0% found this document useful (0 votes)

11 views347 pages

Data Science Ai

The document is an interactive ebook titled 'Data Science and Artificial Intelligence for Undergraduates' by Jens Flemming, published on January 19, 2023. It includes various sections covering topics such as data science courses, Python programming, and practical applications of AI and machine learning. The ebook is designed for undergraduate students and provides resources for both online and offline learning.

Uploaded by

joseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views347 pages

Data Science Ai

Uploaded by

joseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 347

Data Science and Artificial

Intelligence for Undergraduates

Jens Flemming

Jan 19, 2023

This is a printable version of the interactive ebook available at https://www.fh-zwickau.de/ jef19jdw/data-science-ai.
Switch to the online HTML version for getting all features.
CONTENTS

I How to Use This Book 3

1 An Executable Book 5
1.1 Read Online or Offline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Manipulate and Execute Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Contribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Guided Reading 15
2.1 Data Science I (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Data Science II (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Data Science III (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Data Science IV (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

II Warm-Up 21
3 Data Science, AI, Machine Learning 23
3.1 Science With and Of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Example: Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Example: Weather Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Python and Jupyter 27

4.1 Data Science Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Why Python? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Install and Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Computers and Programming 33

5.1 CPU, Memory, IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Representation of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Software and Programming Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

III Python for Data Science 41

6 Crash Course 43
6.1 A First Python Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3 Screen IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Library Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5 Everything is an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Variables and Operators 75

i
7.1 Names and Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

8 Lists and Friends 93

8.1 Tuples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.4 Iterable Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

9 Strings 105
9.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.3 String Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

10 Accessing Data 111

10.1 File IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.2 Text Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
10.3 ZIP Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
10.4 CSV Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.5 HTML Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.6 XML Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.7 Web Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

11 Functions 127
11.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.2 Passing Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.3 Anonymous Functions (Lambdas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.4 Function and Method Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.5 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

12 Modules and Packages 135

12.1 Importing Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.2 Importing Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.3 The Python Standard Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.4 Writing New Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.5 Writing New Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.6 Private Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

13 Error Handling and Debugging Overview 139

13.1 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
13.2 Logging and Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
13.3 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

14 Inheritance 143
14.1 Principles of Object-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14.2 Idea and Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.4 Type Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
14.5 Every Class is a Subclass of object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.6 Virtual Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.7 Multiple Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.8 Exceptions Inherit from Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

15 Further Python Features 149

15.1 Doing Nothing With pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
15.2 Checking Conditions With assert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
15.3 Structural Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

ii
15.4 The set Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.5 Function Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.6 The copy Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.7 Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
15.8 Graphical User Interfaces (GUIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

IV Managing Data with Python 153

16 Efficient Computations with NumPy 155
16.1 NumPy Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
16.2 Array Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
16.3 Advanced Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
16.4 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
16.5 Array Manipulation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
16.6 Copies and Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
16.7 Efficiency Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
16.8 Special Floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
16.9 Linear Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
16.10 Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

17 Saving and Loading Non-Standard Data 181

17.1 Saving and Loading NumPy Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17.2 Saving and Loading Arbitrary Python Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
17.3 Reading Custom Binary File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

18 High-Level Data Management with Pandas 187

18.1 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
18.2 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
18.3 Advanced Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
18.4 Dates and Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
18.5 Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
18.6 Restructuring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
18.7 Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

V Exercises 237
19 Computer Basics 239
19.1 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
19.2 Representation of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
19.3 Memory vs. Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
19.4 Compilers and Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

20 Python Programming 243

20.1 Finding Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
20.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
20.3 More Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
20.4 Variables and Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
20.5 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
20.6 Lists and Friends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
20.7 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
20.8 File Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
20.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
20.10 Object-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

21 Managing Data 269

21.1 NumPy Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
21.2 Image Processing with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

iii
21.3 Pandas Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
21.4 Pandas Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.5 Advanced Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
21.6 Pandas Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

VI Projects 283
22 Install and Use Python 285
22.1 Working with JupyterLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
22.2 Install Jupyter Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
22.3 Python Without Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
22.4 Long-Running Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

23 Python Programming 297

23.1 Simple List Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
23.2 Geometric Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
23.3 Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

24 Weather 305
24.1 DWD Open Data Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
24.2 Getting Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
24.3 Climate Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

25 MNIST Character Recognition 311

25.1 The xMNIST Family of Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
25.2 Load QMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

26 Cafeteria 315
26.1 The API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
26.2 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
26.3 Getting Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
26.4 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

27 Public Transport 317

27.1 Get Data and Set Up the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
27.2 Find Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

28 Corona Deaths 325

28.1 Get some Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
28.2 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
28.3 Death Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

VII Mathematics 327

29 Logic 329
29.1 Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

30 Combinatorics 331
30.1 Factorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

31 Linear Algebra 333

31.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
31.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
31.3 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336

iv
VIII Software Development 337
32 Unified Modeling Language (UML) 339

v
vi
Data Science and Artificial Intelligence for Undergraduates

This book covers a wide range of topics in data science and artificial intelligence. It’s an attempt to provide self-
contained learning material for first-year students in data science related courses. Most, not all, of the material is
tought in the undergradute course on data science1 at Zwickau University of Applied Sciences2 .
Starting teaching data science in 2019 the author3 faced the problem that there seems to be no text book covering
math, computer science, statistical data science, artificial intelligence and related topics in a well structured, accessible,
thorough way. Basic Python4 programming should be covered as well as state of the art deep reinforcement learning
for controlling autonomous robots. All this with hands-on experience for students, interesting real-world data sets,
and sufficiently rich theoretical background.
Classical paper books or PDF ebooks do not suit the needs for this project. Working with data requires lots of source
code, interactive visualizations, data listings, and easy to follow pointers to online resources. Jupyter Book5 is an
awesome software tool for publishing book-like interactive content. For the author writing this book is also a journey
of discovery to the possible future of publishing. Having authored two paper books the author knows the tight limits
of paper books and publishing companies. The greater his enthusiasm is for the freedom in writing and publishing
provided by Jupyter Book and its community, The Executable Books Project6 .
The author expresses its gratitude towards all the more or less anonymous people developing the wonderful open
source tools used in this book and for writing the book. There are too many tools to list them here. The author also
thanks his students and colleagues at Zwickau University, especially Hendrik Weiß, who constantly find typos and
make suggestions for improving the book.
Jens Flemming7 , Zwickau, December 2022

1 https://datascience.fh-zwickau.de
2 https://www.fh-zwickau.de
3 https://www.fh-zwickau.de/~jef19jdw
4 https://www.python.org
5 https://jupyterbook.org
6 https://executablebooks.org
7 https://www.fh-zwickau.de/~jef19jdw

CONTENTS 1
Data Science and Artificial Intelligence for Undergraduates

2 CONTENTS
Part I

How to Use This Book

3
CHAPTER

ONE

AN EXECUTABLE BOOK

This book provides several online features for executing presented Python code and for contributing content. This
chapter contains all information you need for using these features.
• Read Online or Offline (page 5)
• Manipulate and Execute Code (page 6)
• Contribute (page 9)

1.1 Read Online or Offline

This book has been created with Jupyter Book8 . It comes in different formats and variants. The reader may choose
according to her or his preferences. The reader may even switch between different variants at will.

1.1.1 Online in a Webbrowser

The intended medium for reading this book is a website in a webbrowser, that is, the HTML rendering of the book9 .
There you have full functionality including interactive features and live code editing and execution, see Manipulate
and Execute Code (page 6) for details.
The HTML rendering comes with an optional fullscreen mode (button in the upper right corner). You may also hide
the table of contents sidebar on small screens (button in the upper left corner).

Fig. 1.1: Fullscreen button and sidebar button allow to adjust the book’s layout to your screen.

8 https://jupyterbook.org
9 https://www.fh-zwickau.de/~jef19jdw/data-science-ai

5
Data Science and Artificial Intelligence for Undergraduates

1.1.2 Offline in a Webbrowser

You may download the whole HTML rendering as ZIP archive. After extracting the archive open the file index.
html in a webbrowser. All features will work like in the online version, but some content, like externally hosted
videos, won’t be available without internet connection.
The HTML rendering is a static website, that is, no webserver is needed for reading. Everything works on your local
machine.
Download HTML rendering in ZIP file
10

1.1.3 Offline PDF ebook

For printing and for friends of higher quality typesetting there is a PDF version of the book. Of course, the PDF
version lacks interactive features like in-place code editing and execution.
Download PDF ebook
11

1.2 Manipulate and Execute Code

Python code in this book can be executed in different ways without copying the code manually. The HTML rendering’s
upper right corner shows a rocket symbol. The rocket button provides several options for executing a page’s code.

Fig. 1.2: Hovering over the rocket symbol provides several options for code execution.

The next section contains some Python code for testing code execution right here on this page. Subsequent sections
describe button functionality in more detail. Local code execution on your machine is described, too.

Attention: All code execution features but Live Code use the book’s Jupyter12 rendering. For technical reasons
the Jupyter rendering lacks some figures and text formatting may be incorrect. For reading without a need for
code execution stay with the HTML or PDF renderings.

10 https://www.fh-zwickau.de/~jef19jdw/data-science-ai/data-science-ai.zip
11 https://www.fh-zwickau.de/~jef19jdw/data-science-ai/data-science-ai.pdf
12 https://jupyter.org

6 Chapter 1. An Executable Book

Data Science and Artificial Intelligence for Undergraduates

1.2.1 Sample Code for Testing

Here we have some simple Python code for testing code execution features of this executable book. Details on these
features are given below.

a = 2
b = 6
print(a, '+', b, '=', a + b)

2 + 6 = 8

1.2.2 Launch on Binder

The Binder launch button opens a JupyterLab13 session on mybinder.org14 . There you find the book’s Jupyter ren-
dering. The Jupyter rendering is a collection of Jupyter Notebooks (files with ipynb extension). The Binder launch
button opens your current page’s Jupyter rendering, but all other pages are available, too, in one and the same Binder
session.

Fig. 1.3: Binder startup requires cloning the book’s Git repository if something has changed since last Binder usage.

Starting the Binder session may take some seconds. Keep in mind that mybinder.org15 is a free service provided by
volunteers and supported by donators. Don’t overuse it to keep it free and available to everybody. Don’t run complex
computations like neural network training on Binder.
The JupyterLab session on Binder allows for code editing and repeated execution. You may also save your files there,
but they will be lost as soon as you end the session. Don’t forget to download modified files to your local machine
before you leave.
13 https://jupyter.org
14 https://mybinder.org
15 https://mybinder.org

1.2. Manipulate and Execute Code 7

Data Science and Artificial Intelligence for Undergraduates

1.2.3 Launch on Gauss

Gauss16 is a GPU server at Zwickau University of Applied Sciences only available in the university’s intranet.
Students with access to Gauss should use Gauss instead of Binder. The Gauss launch button runs the book’s Jupyter
rendering in JupyterLab on Gauss very similar to Binder.

Fig. 1.4: Gauss asks for username and password before launching a book’s page in JupyterLab.

A click on the Gauss launch button copies the whole GitLab repository17 of the book to the user’s personal directory on
Gauss. Thus, modifications to code and other files are saved to the user’s directory, too, and are persistent. Repeated
clicks on the Gauss launch button do not overwrite a user’s modifications, but may update files untouched by the user
but modified in the GitLab repository. Thus, the user’s version will always be up-to-date while preserving the user’s
modifications as far as possible. For details on the merge process run when clicking the Gauss launch button see
Automatic Merging Behavior18 in nbgitpuller’s documentation.

1.2.4 Live Code

The Live Code button makes code cells editable and executable on-the-spot using Thebe19 . Clicking the Live Code
button starts a Python kernel on mybinder.org20 and connects the book’s HTML rendering to that kernel. Progress
and success of the startup process are shown below the page’s heading.

Fig. 1.5: A box with progress information appears after clicking the Live Code button.

Code cells on the page change their appearance. Outputs now belong to the cell, some buttons appear, and the code
becomes editable.
Cells are not run immediately after clicking the Live Code button. Clicking the ‘run’ button executes the cell. Alter-
natively, one may run all cells on a page by clicking the ‘restart & run all’ button.
16 https://gauss.fh-zwickau.de
17 https://gitlab.hrz.tu-chemnitz.de/jef19jdw--fh-zwickau.de/data-science-ai
18 https://jupyterhub.github.io/nbgitpuller/topic/automatic-merging.html
19 https://github.com/executablebooks/thebe
20 https://mybinder.org

8 Chapter 1. An Executable Book

Data Science and Artificial Intelligence for Undergraduates

Fig. 1.6: After lauching Live Code each code cell shows buttons for starting and controlling code execution.

1.2.5 Local Code Execution

To execute the book’s Python code on your local machine download a page’s Jupyter rendering by clicking the down-
load button in the upper right corner of the HTML rendering or clone the book’s Git repository21 to your machine.

Fig. 1.7: Hovering the download symbol shows a list of available formats.

On your machine you need JupyterLab22 or a similar tool from the Jupyter ecosystem to view and modify the ipynb
files. For install instructions have a look at the Install Jupyter Locally (page 289) project.

1.3 Contribute

This book is not static. It will grow over time and existing material will be rearranged, updated, reduced or extended
as required by future developments in data science, AI and teaching. In this sence, it’s a living or dynamic book. And
you, the reader, are invited to contribute to the book.
The book’s HTML rendering shows a contribution button in the upper right corner offering several options. Those
options will be discussed in detail below.

Fig. 1.8: Hovering over the contribution button shows all options for contributing to the book.

21 https://gitlab.hrz.tu-chemnitz.de/jef19jdw--fh-zwickau.de/data-science-ai
22 https://jupyter.org

1.3. Contribute 9
Data Science and Artificial Intelligence for Undergraduates

1.3.1 Repository Button

The repository button is a simple link to the book’s GitLab repository. There you find all source files needed to render
the book. If you are familiar with Git and GitLab the repository button is a good starting point for contributing to
the project. If you are not familiar with Git and GitLab, don’t worry and use one of the other contribution options.

Fig. 1.9: The repository button leads to a page in GitLab with information about the project, including information
about the most recent update.

Important: The book’s public Git repository is hosted on a GitLab instance provided by Chemnitz University
of Technology23 for all Saxon universities. Actions requiring a user account are restricted to members of Saxon
universities.

1.3.2 Open Issue Button

The open issue button allows to ask questions or report bugs, typos and so on. Clicking this button opens a new issue
in the book’s GitLab repository. Put your question, bug report, what ever in the description field and click the ‘Create
issue’ button.
The author will have a look at the issue as soon as possible and post an answer. Depending on your GitLab ac-
count’s configuration you will receive email notifications if someone posts a comment on your issue. Each reader
may comment on each other reader’s issues. So readers may help other readers to solve problems where appropriate.

1.3.3 Suggest Edit Button

If you spot a typo or if you would like to add some explanatory note or an additional code example you should hit
the ‘suggest edit’ button. The button opens the current page for editing in GitLab. The syntax of the text file is
MyST markdown24 and should be self-explanatory. If your edit contains more than a typo correction, leave a commit
message summarizing your edit. Then click ‘Commit changes’.
On the next page click ‘Create merge request’. This will send a notification to the author that somebody suggested an
edit.
The author will have a look at your suggestion as soon as possible. GitLab’s merge request feature allows for discussing
and modifying the edit if necessary before the author merges the edit into the book’s source and, thus, into the
published book.
23 https://www.tu-chemnitz.de
24 https://myst-parser.readthedocs.io

10 Chapter 1. An Executable Book

Data Science and Artificial Intelligence for Undergraduates

Fig. 1.10: To open an issue simply type a description and click on ‘Create issue’. The title field is prefilled and should
remain untouched.

1.3. Contribute 11
Data Science and Artificial Intelligence for Undergraduates

Fig. 1.11: Edit the pages markdown source and leave a commit message to describe more elaborate edits.

12 Chapter 1. An Executable Book

Data Science and Artificial Intelligence for Undergraduates

Fig. 1.12: Simply click ‘Create merge request’. Fill the description field if you feel a need for additional explanation
next to your commit message.

1.3. Contribute 13
Data Science and Artificial Intelligence for Undergraduates

1.3.4 Other Forms of Contribution

If you don’t have an account at the book’s GitLab instance feel free to send issues and suggestions for edits to the
author25 by email.

25 https://www.fh-zwickau.de/~jef19jdw

14 Chapter 1. An Executable Book

CHAPTER

TWO

GUIDED READING

The teaching material in this book, including exercises and projects, may be arranged in different ways to meet the
needs of a comprehensive lecture series or to accompany a one-day workshop on machine learning, for instance.
In each section the chapter presents a selection of material together with hints on working through it. Students of
Zwickau University’s data science course will find the material for each semester here.
• Data Science I (course at WHZ) (page 15)
• Data Science II (course at WHZ) (page 20)
• Data Science III (course at WHZ) (page 20)
• Data Science IV (course at WHZ) (page 20)

2.1 Data Science I (course at WHZ)

The first part of the data science lecture series introduces the Python programming language and some Python libraries
required for data processing. Next to Python the focus is on working with big data, obtaining, understanding and
restructuring data, as well as extracting basic statistical information from data.

2.1.1 Warm-Up

Week 1

Lectures
• Data Science, AI, Machine Learning (page 23)
• Python and Jupyter (page 27)
• Computers and Programming (page 33)
Self-study
• Computer Basics (page 239) (exercises)
• Install and Use Python (page 285)
– Working with JupyterLab (page 285) (project)
Practice session
• Install and Use Python (page 285)
– Install Jupyter Locally (page 289) (project)

15
Data Science and Artificial Intelligence for Undergraduates

2.1.2 Python for Data Science

Week 2 (Crash Course I)

Lectures
• Crash Course (page 43)
– A First Python Program (page 43)
– Building Blocks (page 46)
Self-study
• Finding Errors (page 243) (exercises)
• Basics (page 249) (exercises)
Practice session
• Install and Use Python (page 285)
– Python Without Jupyter (page 292) (project)
• Python Programming (page 297)
– Simple List Algorithms (page 297) (project)

Week 3 (Crash Course II)

Lectures
• Crash Course (page 43), continued
– Screen IO (page 56)
– Library Code (page 57)
– Everything is an Object (page 59)
Self-study
• More Basics (page 251) (exercises, last task is bonus)
Practice session
• Python Programming (page 297)
– Geometric Objects (page 300) (project, last section is bonus)

Week 4 (Variables and Operators)

Lectures
• Variables and Operators (page 75)
– Names and Objects (page 75)
– Types (page 80)
– Operators (page 85) (section Operators as Member Functions (page 88) is bonus)
– Efficiency (page 89) (all but section Garbage Collection (page 91) is bonus)
Self-study
• Variables and Operators (page 253) (exercises)
• Memory Management (page 256) (exercises, all but the last two tasks are considered bonus)
Practice session

16 Chapter 2. Guided Reading

Data Science and Artificial Intelligence for Undergraduates

• Weather (page 305)

– DWD Open Data Portal (page 305) (project)
• Python Programming (page 297)
– bonus: Vector Multiplication (page 301) (project)

Week 6 (Accessing Data)

Lectures
• Accessing Data (page 111)
– File IO (page 111)
– Text Files (page 114)
– ZIP Files (page 117)
– CSV Files (page 118)
– HTML Files (page 119)
– XML Files (page 121)
– Web Access (page 122)
Self-study
• File Access (page 264) (exercises)
Practice session
• Cafeteria (page 315), download part (project)

2.1. Data Science I (course at WHZ) 17

Data Science and Artificial Intelligence for Undergraduates

Week 7 (Functions, Modules, Packages)

Lectures
• Functions (page 127)
– Basics (page 127)
– Passing Arguments (page 128)
– Anonymous Functions (Lambdas) (page 132)
– Function and Method Objects (page 132)
– Recursion (page 133)
• Modules and Packages (page 135)
Self-study
• Functions (page 265) (exercises)
Practice session
• Cafeteria (page 315), parsing part (project)

Week 8 (Errors, Debugging, Inheritance)

Lectures
• Error Handling and Debugging Overview (page 139)
• Inheritance (page 143) (last section Exceptions Inherit from Exception (page 147) is bonus)
• Unified Modeling Language (UML) (page 339) (bonus)
Self-study
• Object-Oriented Programming (page 267) (exercises)
• Further Python Features (page 149) (bonus reading)
Practice session
• Weather (page 305)
– Getting Forecasts (page 307), download part (project)

2.1.3 Managing Data with Python

Week 9 (NumPy Basics)

Lectures
• Efficient Computations with NumPy (page 155)
– NumPy Arrays (page 155)
– Array Operations (page 161)
– Advanced Indexing (page 165)
– Vectorization (page 166)
Self-study
• NumPy Basics (page 269) (exercises)
Practice session
• Weather (page 305)

18 Chapter 2. Guided Reading

Data Science and Artificial Intelligence for Undergraduates

– Getting Forecasts (page 307), parsing part (project, automatic download is bonus)

Week 10 (Advanced NumPy)

Lectures
• Efficient Computations with NumPy (page 155), continued
– Array Manipulation Functions (page 168)
– Copies and Views (page 171)
– Efficiency Considerations (page 173)
– Special Floats (page 175)
– Linear Algebra Functions (page 177)
– Random Numbers (page 179)
• Saving and Loading Non-Standard Data (page 181)
Self-study
• Image Processing with NumPy (page 271) (exercises, last one is bonus)
Practice session
• MNIST Character Recognition (page 311)
– Load QMNIST (page 313) (project)

Week 11 (Pandas Basics)

Lectures
• High-Level Data Management with Pandas (page 187)
– Series (page 188)
– Data Frames (page 198)
Self-study
• Pandas Basics (page 273) (exercises)
Practice session
• Public Transport (page 317)
– Get Data and Set Up the Environment (page 317) (project)

Week 12 (Advanced Indexing, Dates and Times)

Lectures
• High-Level Data Management with Pandas (page 187), continued
– Advanced Indexing (page 207)
– Dates and Times (page 217)
Self-study
• Pandas Indexing (page 276) (exercises)
Practice session
• Corona Deaths (page 325) (project)

2.1. Data Science I (course at WHZ) 19

Data Science and Artificial Intelligence for Undergraduates

2.2 Data Science II (course at WHZ)

The second semester of the data science lecture series starts with visualization techniques. Then supervised ma-
chine learning for generating predictions from data is introduced. Linear regression and artificial neural networks are
discussed in depth.
to be continued…

2.3 Data Science III (course at WHZ)

Part three of the data science lecture series continues discussion of supervised machine learning. Further methods
like decision trees and support vector machines are introduced. Then we move on to unsupervised machine learning
covering clustering methods and techniques for dimensionality reduction.
to be continued…

2.4 Data Science IV (course at WHZ)

The last part of the data science lecture series is devoted to reinforcement learning. Next to very basic techniques we
also discuss state-of-the-art deep reinforcement learning with artificial neural networks.
to be continued…

20 Chapter 2. Guided Reading

Part II

Warm-Up

21
CHAPTER

THREE

DATA SCIENCE, AI, MACHINE LEARNING

Data Science comes in different flavors and sometimes denotes different things. Some clarification on the terms used
in this book and on the subjects covered is mandatory.

3.1 Science With and Of Data

With the advent of cheap storage devices in the last decade of the 20th century companies, governments, other
organizations and also private individuals started collecting data at large scale (big data). In a world full of data
somebody has to think about how to make information accessible which is hidden in data. Computer Scientists and
Mathematicians developed a bunch of methods for extracting information, more and more applications popped up,
methods became more complex,… a new field of research was born. This new field matured, got the name ‘data
science’ and now is accepted as serious field of research and teaching.
Data Science as a science field covers all technical aspects of data processing. There’s large overlap with computer sci-
ence and mathematics, but also with many other fields, depending on where data comes from. Mathematics provides
advanced methods for extracting information from data. Computer science allows for their realization.
Data Science also touches law, ethics and sociology. May I use this data set for my project? Is it okay to collect and
dig through personal data? What impact will extensive data collection and processing have on society?
Almost every data science project has four phases:
1. Collect Data
Data has to be recorded and stored somehow. Planning and realizing data collection processes is referred to as data
engineering. Typical tasks in this phase are, for instance, installing and configuring sensors, setting up data base
storage, and implementing techniques for supervising data flow.
2. Clean and Restructure Data
Raw data sets often contain errors, missing items or false items. They have to be cleaned. Almost always several data
sets have to be combined to allow for succesful extraction of information. These preprocessing steps require lots of
manual work and domain knowledge. Careful preprocessing will simplify subsequent processing steps and is at least
as important as the modeling phase.
3. Create a Model
From recorded and preprocessed data a mathematical or algorithmic model is build. Depending on the concrete
problem to solve from data, such a model may describe the data set (descriptive model) or it may be used to answer
some question based on the data set (predictive model).
4. Communicate Results
Findings from the data have to be communicated to the client. Visualizations are the most important tool for delivering
results.
In this book we focus on preprocessing and modeling. Data engineering and communication of results will be touched
occasionally only. The visualization aspect of communication also plays an important role in preprocessing when
exploring a new data set (explorative data analysis, short: EDA). So we will cover the full range of visualization tools
and techniques there.

23
Data Science and Artificial Intelligence for Undergraduates

3.2 Example: Customer Segmentation

Brick-and-mortar stores as well as online shops collect as much customer data as they can to understand customer
behavior. Knowing how many people buy which products at which time in which quantities is essential for efficient
warehousing. But customer data is also used for targeted ad campaigns.
For targeted advertising one tries to identify groups of customers with similar behavior. For each group tailor-made
ads are created. Customer segmentation is an example of descriptive modeling. The aim is to understand the collected
data and to find structures not obvious at first glance.
Typical tasks in the four phases described above are:
1. Collect Data
• implement a network infrastructure to collect sales data from all stores in a central data base
• issue customer cards to know who comes to your shop (age, gender, location,…)
• think about buying external data about your customers (Schufa,…)
• check legal situation to know whether you are allowed to collect the data you want
2. Clean and Restructure Data
• throw away all the data not relevant for segmentation (for instance, data of customers not living in the targeted
region)
• transform data (for instance, convert absolute quantities to relative quantities: milk made 5% of the shopping
cart)
• restructure data to get per-customer data instead of per-shop or per-product
3. Create a Model
• apply some standard segmentation method
• try to understand the identified customer groups, find unique characteristics
4. Communicate Results
• present groups and their unique characteristics to the advertising department

3.3 Example: Weather Forecast

Weather forecasting is a typical example of predictive modeling. From past data we want to create a model which
yields information on future weather parameters. In the past lots of experts analyzed recorded weather data and
made predictions mainly from experience and classical mathematical and physical modeling. Data science allows to
automate the forecasting process. Instead of handcrafted models and expert knowledge one creates a predictive data
model based on all (or sufficiently much) recorded weather data.
1. Collect Data
• decide which weather parameters to record (temperature, humidity,…)
• implement a network infrastructure to collect weather data from across the world
• build and launch satellites
• build terrestrial weather stations
2. Clean and Restructure Data
• decide for a subset of data to use for forecasting (for instance, only use data from past 30 days)
• transform data (for instance, harmonize temperature units: Fahrenheit, Celsius)
• restructure data (for instance, downsample data from 5-minute periods to hourly values)
3. Create a Model

24 Chapter 3. Data Science, AI, Machine Learning

Data Science and Artificial Intelligence for Undergraduates

• apply some standard method for predictive modeling

• verify the quality of your model’s predictions
4. Communicate Results
• from numerical outputs of the model make a human readable forecast (for instance, round temperatures to at
most one decimal place)

3.4 Artificial Intelligence

Artificial intelligence to some extent is a buzzword. It’s used for computer programs doing things we consider in-
telligent. Examples are image classification (what is shown on the image?), language processing (translate a text),
autonomous driving (orient and move in a complex environment). Under the hood there’s still a classical computer
program, no intelligence.
Most, if not all, methods related to artifical intelligence are based on processing large data sets. Image and language
processing systems are trained on large data sets of sample images and sample texts. Autonomous driving uses
reinforcement learning, which can be understood as collecting large amounts of data while exploiting information
extracted from previously collected data (data collection on demand). In this sence, artificial intelligence is a subfield
of data science. In this book we also cover this vague field of articifial intelligence, including reinforcement learning.
There’s also a strict mathematical definition of artificial intelligence: A computer system is intelligent if it passes the
Turing test. In the Turing test a human chats with another human and the computer system in parallel. If the human
cannot decide which of both chat partners is human, the computer passed the test. Up to now no computer passed
the Turing tests. If interested, have a look at Wikipedia’s article on the Turing test26 .

3.5 Machine Learning

By machine learning we denote the process of writing computer programs ‘learning’ to do something from data. In
other words, we set-up a model with lots of unknowns and then fit the model to our data. So machine learning refers
to a style of software development. We do not write a program line by line. Instead we use a general purpose program
and fill in the details automatically based on some data set.
Machine learning may be regarded as the hard core of data science and artificial intelligence, where all the mathematics
is contained in.

26 https://en.wikipedia.org/wiki/Turing_test
27 https://xkcd.com/1838

3.4. Artificial Intelligence 25

Data Science and Artificial Intelligence for Undergraduates

Fig. 3.1: The pile gets soaked with data and starts to get mushy over time, so it’s technically recurrent. Source:
Randall Munroe, xkcd.com/183827

26 Chapter 3. Data Science, AI, Machine Learning

CHAPTER

FOUR

PYTHON AND JUPYTER

In this book we use the Python28 programming language for talking to the computer. Tools from the Jupyter29
ecosystem allow for Python programming in a very comfortable graphical environment.

4.1 Data Science Tools

There are lots of software tools for data science and artificial intelligence. They can be devided into two groups:
Tailor-made GUI tools
For common tasks in data science and AI like clustering data or classifying images there exist (mostly commerical)
tools with graphical user interface (GUI). Such tools are easy to use, but they have a very limited scope of application.
Each task requires a different tool. Available methods are restricted to well known ones. Implementing new problem
specific methods is not possible.
General Purpose Tools
To enjoy maximum freedom in choice of methods one has to leave the world of GUI tools. Creating data science
models (that is, computer programs) without any restrictions requires the use of some high-level programming lan-
guage. Examples are R30 and Python31 . Both languages are very common in the data science community because
they ship with lots of extensions for simple usage in data science and AI.
Tailor-made tools come and go as time moves on. Programming languages are much more long-lasting. In this book
we stick to the Python programming language and its ecosystem. The R programming language would be a good
alternative, but sticks more to statistical tasks than to general purpose programming.

Tip: Some people feel frightend if someone says ‘programming language’. Think of programming languages as
usual software tools. The only difference is that they provide much more functionality than GUI tools. But there’s
not enough space on screen to have a button for each function. So we write text commands.

4.2 Why Python?

Python is a modern, free and open source programming language. It dates back to the early 1990s with a first official
release in 1994. It’s father and BDFL (benevolent dictator for life) is Guido van Rossum32 .
Python code is very readable and straight forward without too many cumbersome symbols like in most other pro-
gramming languages. Many technical aspects of computer programming are managed by Python instead of by the
programmer. With Python one may develop the full range of software, from simple scripts to fully featured web or
desktop applications. Thousands of extensions allow for rapid development.
28 https://www.python.org
29 https://jupyter.org
30 https://www.r-project.org/
31 https://www.python.org
32 https://en.wikipedia.org/wiki/Guido_van_Rossum
33 https://xkcd.com/353

27
Data Science and Artificial Intelligence for Undergraduates

Fig. 4.1: I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I’m leaving you. Source: Randall
Munroe, xkcd.com/35333

28 Chapter 4. Python and Jupyter

Data Science and Artificial Intelligence for Undergraduates

There’s a large online community discussing Python topics. Almost every problem you’ll encounter has already been
solved. Simply use a search engine to find the answer.

Fig. 4.2: Popuparity of programming languages on Stack Overflow34 . Source: Stack Overflow Trends35 (modified
by the author)

Some rules followed by Python and its community are collected in the Zen of Python36 . Here are some of them:
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
• There should be one – and preferably only one – obvious way to do it.
• Although that way may not be obvious at first unless you’re Dutch.
• If the implementation is hard to explain, it’s a bad idea.
• If the implementation is easy to explain, it may be a good idea.
Last but not least Python is available on all platforms, Linux, macOS, Windows, and many more. Youtube’s player is
written in Python and many other tech giants use Python. But it’s also not unlikely that a Python script controls your
washing machine.

Hint: There are two versions of Python: Python 2 and Python 3. Source code is not compatible, that is, there are
programs written in Python 2 which cannot be executed by a Python 3 interpreter. In this book we stick to Python
3. Python 2 is considered deprecated since January 202037 .

34 https://stackoverflow.com
35 https://insights.stackoverflow.com/trends?tags=python%2Cjavascript%2Cjava%2Cc%23%2Cphp%2Cc%2B%2B%2Cr
36 https://en.wikipedia.org/wiki/Zen_of_Python
37 https://www.python.org/doc/sunset-python-2/

4.2. Why Python? 29

Data Science and Artificial Intelligence for Undergraduates

4.3 Jupyter

The Jupyter ecosystem is a collection of tools for Python programming with emphasis on data science. Jupyter allows
for Python programming in a webbrowser. Outputs, including complex and interactive visualizations, can be put right
below the code producing these outputs. Everything is in one document: code, outputs, text, images,…

Fig. 4.3: JupyterLab is the most widely used member of the Jupyter ecosystem. It brings Python to the webbrowser.

In this book you’ll meet at least four members of the Jupyter ecosystem.
JupyterLab38
JupyterLab is a web application bringing Python programming to the browser. It’s the everyday tool for data science.
JupyterLab may run on a remote server (cloud) or on your local machine.
Jupyter Notebook39
An alternative to JupyterLab is Jupyter Notebook. It’s a predecessor of JupyterLab and provides almost identical
functionality, but with different look and feel.
JupyterHub40
Running JupyterLab in the cloud requires user authentication and user management. JupyterHub provides everything
we need to run several JupyterLabs on a server in parallel. Almost all JupyterLab providers (e.g., Gauss at Zwickau
University, Binder) rely on JupyterHub.
Jupyter Book41
This book is being published using Jupyter Book. Each page is a Jupyter notebook file. Jupyter Book provides
automatic generation of table of contents, handling bibliographies and rendering to different output formats.
38 https://jupyter.org
39 https://jupyter.org
40 https://jupyter.org/hub
41 https://jupyterbook.org

30 Chapter 4. Python and Jupyter

Data Science and Artificial Intelligence for Undergraduates

4.4 Install and Use

Work through the following projects to get up and running with Python and Jupyter:
• Working with JupyterLab (page 285)
• Install Jupyter Locally (page 289)
• Python Without Jupyter (page 292)

4.4. Install and Use 31

Data Science and Artificial Intelligence for Undergraduates

32 Chapter 4. Python and Jupyter

CHAPTER

FIVE

COMPUTERS AND PROGRAMMING

Computers are the main tool for data science and artificial intelligence. In this chapter we answer some basic questions:
• What is a computer?
• What are bits, bytes, kilobytes,…?
• What is software?
• What is programming and what are programming languages?
Although we don’t have to know detailed answers to these questions, we should have some rudimentary understanding
of what happens inside a computer.
Related exercises: Computer Basics (page 239).

5.1 CPU, Memory, IO

Each modern computer consists of three components: central processing unit (CPU), memory, input/output (IO)
devices. These components are connected by many wires, which are organized together with some auxiliary stuff on
the computer’s mainboard.

Fig. 5.1: There’s a tight and fast data connection between CPU and memory. IO devices are connected to the CPU
and in some cases also directly to memory.

33
Data Science and Artificial Intelligence for Undergraduates

IO devices are all parts of the computer which provide an interface to humans like screen, keyboard, printer, scanner.
But also mass storage devices (hard disk drives, SSDs, DVD drives, card readers and so on) are IO devices. Another
kind of IO devices are network adapters for Ethernet, Wi-Fi, Bluetooth and others. The common feature of all IO
devices is that they produce and/or consume streams of binary data. ‘Binary’ means that there are only two different
values, usually denoted by 0 and 1. Electrically, 0 might stand for low voltage and 1 for high voltage.
Memory can store streams of binary data. In some sense it is similar to mass storage IO devices, but it is used in a
very different way. Most storage devices are very slow and access times for reading and writing data depend on the
position of the data on the device. In contrast, memory access is very fast and access times are independent of the
data’s concrete location. Whenever data has to be stored for a short time only, memory is used. Due to technological
reasons memory loses all data when power is turned off, whereas data on mass storage devices persists.
The CPU is a highly integrated circuit which processes streams of binary data. ‘Processing’ means that incoming data
from memory and/or IO devices is transformed and then sent to memory and/or IO devices. If a binary stream from
memory is interpreted as instructions by the CPU, then we say that the stream contains code. Data in the stricter
sense refers to parts of binary streams that are processed by the CPU, but which do not tell the CPU what to do.
Memory can contain code and data, whereas IO devices only produce non-code data.

5.2 Bits and Bytes

A bit is a piece of binary information. It either holds a one or a zero. Less information than a bit is no information.
With a sequence of 𝑘 bits we can express 2𝑘 different values, for example the numbers 0, 1, … , 2𝑘 − 1.

Fig. 5.2: Wtih 3 bits we may represent 8 different values. Each additional bit doubles the number of possible values.

bits number of values usual interpretation

1 2 0, 1
2 4 0, 1, 2, 3
3 8 0…7
4 16 0 … 15
5 32 0 … 31
6 64 0 … 63
7 128 0 … 127
8 256 0 … 255
16 65 536 0 … 65 535
24 16 777 216 0 … 16 777 215
32 4 294 967 296 0 … 4 294 967 295

By convention binary data in modern computers is organized in groups of 8 bits. A sequence of 8 bits is denoted as
a byte.

34 Chapter 5. Computers and Programming

Data Science and Artificial Intelligence for Undergraduates

Following the metric system, there are kilobytes (1000 byte), megabytes (1000 kilobyte), gigabytes (1000 megabytes),
and so on with prefixes tera, peta, exa, zetta, yotta. Corresponding symbols are kB or KB, MB, GB, TB, PB, EB, ZB,
YB.
In some hardware oriented fields of computer science it is common practice to use the factor 1024=210 instead of
1000. Thus, the size of a kilobyte may be 1000 or 1024 bytes. As a rule of thumb 1000 is used for data transmission
and 1024 is used for memory and storage related things (except in adds for storage devices, because 1024 would
give a lower number of gigabytes). Sometimes the prefixes kibi, mebi, gibi, tebi, pebi, exbi, zebi, yobi are used with
corresponding symbols KiB, MiB, GiB, TiB, PiB, EiB, ZiB, YiB for factor 1024. One kibibyte, for instance, has
1024 bytes.

factor name symbol bytes

1000 kilobyte kB or KB 1 000
1024 kibibyte KiB 1024
1000 megabyte MB 1 000 000
1024 mebibyte MiB 1 048 576
1000 gigabyte GB 1 000 000 000
1024 gibibyte GiB 1 073 741 824
1000 terabyte TB 1 000 000 000 000
1024 tebibyte TiB 1 099 511 627 776
1000 petabyte PB 1 000 000 000 000 000
1024 pebibyte PiB 1 125 899 906 842 624
1000 exabyte EB 1 000 000 000 000 000 000
1024 exbibyte EiB 1 152 921 504 606 846 976
1000 zettabyte ZB 1 000 000 000 000 000 000 000
1024 zebibyte ZiB 1 180 591 620 717 411 303 424
1000 yottabyte YB 1 000 000 000 000 000 000 000 000
1024 yobibyte YiB 1 208 925 819 614 629 174 706 176

Important: Computers only work with binary data. Everything has to be represented as sequences of zeros and
ones. For integers, like 123, this is quite simple (see below). Rational numbers, like 0.123, may be represented by
two integers, a numerator 123 and a denominator 1000 for instance. But what about text data? Or images?
Data which cannot be represented as sequence of zeros and ones cannot be processed by a computer. We’ll
come back to this representation issue several times in this book.

5.3 Representation of Numbers

Numbers may have a name, like one, two, three, four, five, six, seven, eight, nine, ten. There are even more named
numbers: eleven, twelve and zero, for instance. Obviously, not all numbers can have an individual name. We need a
system for automatically naming numbers and also for writing them down. At this point it is important to distinguish
between numbers, which can be used for counting and computations, and their representation in spoken and written
language.
In everyday life we use the decimal system based on 10 different digit symbols because we have 10 fingers. An
octopus surely would invent a numbering system with only 8 digits. The Maya civilization used a 20 digits system
(fingers plus toes). Computers would have invented number systems based on 2 digits, because they are representable
by 1 bit, or 4 digits (2 bits) or 8 (3 bits) or 16 (4 bits).
42 https://en.wikipedia.org/wiki/Dresden_Codex
43 https://commons.wikimedia.org/wiki/File:Maya_Hieroglyphs_Plate_32.jpg

5.3. Representation of Numbers 35

Data Science and Artificial Intelligence for Undergraduates

Fig. 5.3: Maya numerals on a page of a Maya book known as Dresden Codex42 . Source43 : Sylvanus Morley via
Wikimedia Commons, modified by the author.

36 Chapter 5. Computers and Programming

Data Science and Artificial Intelligence for Undergraduates

5.3.1 Positional Notation of Numbers

There are many systems for writing down (and naming) numbers. Today the most widely used ones are positional.
An example for a non-positional system are Roman numerals.
Fix a number 𝑏, the basis, and take 𝑏 symbols to denote the first 𝑏 numbers. Here, we interpret zero as the first number,
followed by one as the second number and so on. In case 𝑏 is less than ten, we may use the symbols 0, 1, … , 9 for the
numbers from zero to nine. Every number 𝑐 has a unique representation of the form

𝑐 = 𝑎 𝑛 𝑏 𝑛 + ⋯ + 𝑎 2 𝑏 2 + 𝑎1 𝑏 1 + 𝑎 0 𝑏 0 ,

where 𝑎0 , 𝑎1 , … , 𝑎𝑛 ∈ {0, 1, … , 9} are the digits and 𝑛 + 1 is the number of digits required to express the number 𝑐
with respect to the basis 𝑏. With this unique representation at hand, we may write the number 𝑐 as a list of its digits:
𝑐 = 𝑎𝑛 … 𝑎0 . Keep in mind that the basis 𝑏 has to be known to interpret a list a digits although 𝑏 often is not written
down explicitly.
Example: If we take, for instance, the number twelve, then with ten as base 𝑏 we would have

twelve = 1 ⋅ 𝑏1 + 2 ⋅ 𝑏0 = 12.

Numbers given in base ten are denoted as decimal numbers. More exactly, one should say ‘a number in decimal
representation’ since the number itself does not care about how we write it down.
To avoid confusion, each number we write down without explicitly specifying a basis is to be understood as a decimal
number. Numbers in a basis other than ten always will come with some hint on the basis.

5.3.2 Binary Numbers

Numbers in positional representation with base 2 are called binary numbers. They frequently appear in computer
engineering. Symbols for the two digits are 0, 1 and sometimes the letters O, I.
Example: Number twelve in binary representation is

twelve = 1 ⋅ 𝑏3 + 1 ⋅ 𝑏2 + 0 ⋅ 𝑏1 + 0 ⋅ 𝑏0 = 1100 (binary).

5.3.3 Octal Numbers

Base 8 yields octal numbers. For octal numbers the usual digits 0 to 8 can be used.
Example:

12 = 1 ⋅ 81 + 4 ⋅ 80 = 14 (octal).

Octal numbers occur for instance in file access permission on Unix-like systems because access is controlled by 3 sets
(owner, group, all) of 3 bits (read, write, execute). Thus, all possible combinations can be conveniently expressed by
three-digit octal numbers.
Example: Access right 750 (which is 111 101 000 in binary notation) says that the file’s owner may read, write and
execute the file. The owner’s group is not allowed to write to the file (only read and execute). All other users do not
have any access right.

5.3.4 Hexadecimal Numbers

Base 16 yields hexadecimal numbers. For hexadecimal numbers we use 0 to 9 followed by the symbols a, b, c, d, e,
f to denote the digits.
Examples:

12 = 12 ⋅ 160 = c (hexadecimal),
125 = 7 ⋅ 16 + 13 ⋅ 160 = 7d (hexadecimal).
1

5.3. Representation of Numbers 37

Data Science and Artificial Intelligence for Undergraduates

Note that letters a to f might be digits of a hexadecimal number as well as variable names. Have a look at the context
to get the correct meaning. Sometimes capital letters A, B, C, D, E, F are used.
Hexadecimal numbers occur in many different situations because the range 0 to 255 of a byte value maps exactly to
the set of all two-digit hexadecimal numbers: 00 to ff. We will meet this notation when specifying colors.
Example: The color value ff c0 60 yields a light orange (100% red intensity, 75% green intensity, 38% blue intensity).

Fig. 5.4: Professional graphics programs show hexadecimal color values, often denoted as ‘HTML notation’, because
hexadecimal color values frequently occur in HTML44 code for websites.

5.4 Software and Programming Languages

Software is a stream of binary data to be read and processed by the CPU. The task of a software developer is to
generate streams of binary data which make the CPU do what the software developer wants it to do.

Hint: ‘Binary data’ has at least two different meanings, depending on the context.
• In programming contexts, where we have to distinguish between computer and human readable data, data is
considered ‘binary’ if it has no useful interpretation as text.
• In more general contexts, data is considered ‘binary’ if it is or can be represented as a sequence of zeros and
ones. In this sense, a picture is not binary data, but a digital copy consisting of pixels instead of brushstrokes
is binary data.

Modern software has a size of several megabytes or even gigabytes. It isn’t impossible for humans to generate such
large and complex amounts of binary data by hand. Instead, the process of software development has been auto-
mated step by step beginning from scratch with directly coding zeros and ones in the 1950s up to nowadays higher
programming languages.
44 https://en.wikipedia.org/wiki/HTML

38 Chapter 5. Computers and Programming

Data Science and Artificial Intelligence for Undergraduates

5.4.1 Assemblers

A first step of automation has been the invention of assemblers. That are computer programs which transform a
set of to some extent human readable codewords to a sequence of zeros and ones processable by a CPU. Here is an
example:

mov 120, eax

mov 124, ebx
add ebx, eax
mov eax, 128

The first line tells the CPU to read 4 bytes from memory address 120 and to store them in one of its registers (a kind
of internal memory). Second line does the same, but with memory address 124 and a second CPU register. Then the
CPU is told to add both values. The CPU stores the result in its eax register. The last line makes the CPU write the
result of addition to memory address 128.
Writing computer programs in assembler code made software development much easier. But due to the very limited
instruction set reflecting one-to-one the instruction set of the CPU, programs are hard to read and tightly bound
to the hardware they were designed for. The only advantage of assembler code compared to modern programming
languages is its speed of execution and its small size after transforming it to binary code. The first initialization routine
of modern operating systems is still written in assembler code, because it has to fit into a small predefined portion of
a storage device called boot sector.

5.4.2 Structured Programming

A further step in the evolution of programming languages are languages for structured programming. Examples are C,
BASIC, Pascal. Here the hardware is almost completely abstracted and a relatively complex program, the compiler,
is needed to transform the source code written by the software developer to binary code for the CPU. Here is a
snipped of a C program:

int a, b;
a = 5;
b = 10 * a + 7;
printf("result is %i", b);

The first line tells the compiler that we need two places in memory for storing integer values. The second line makes
the CPU move the value 5 to the place in memory referenced by a. Third line makes the CPU do some calculations
and store the result in memory referenced by b. Finally, the result shall be printed on screen. Writing this in assembler
code would require some hundred lines of code and we would have to take care of memory organization (where is
free space?) and of the instruction set of the CPU. Both is done by the compiler. Especially the C language is still of
great importance. It is used, for example, for large parts of Linux and Windows.

5.4.3 Object-Oriented Programming

A further layer of abstraction is object-oriented programming. Instead of handling hundreds of variables and hundreds
of functions for their processing, everything is organized in a well structured way reflecting the structure of the real
world. Examples of programming languages allowing for object-oriented programming are C++, Java, Python.

5.4. Software and Programming Languages 39

Data Science and Artificial Intelligence for Undergraduates

5.4.4 Compiler vs. Interpreter

Source code of a computer program either is compiled or interpreted. Compiling means that the source code is
translated to binary code and after finishing this translation it can be executed, that is, fed to the CPU. Interpretation
means that the source code is translated line by line and each translated line is sent immediately to the processor.
Compiled programs run much faster than interpreted ones. But interpreted programs allow for simpler debugging
and more intuitive elements in the programming language. Sometimes interpreted programs are called scripts and
corresponding languages are denoted as scripting languages. C and C++ are compiled languages whereas Python is
interpreted. Java is somewhere in between.

40 Chapter 5. Computers and Programming

Part III

Python for Data Science

41
CHAPTER

SIX

CRASH COURSE

This chapter provides an overview of everyday Python features and their basic usage.
• A First Python Program (page 43)
• Building Blocks (page 46)
• Screen IO (page 56)
• Library Code (page 57)
• Everything is an Object (page 59)
Related exercises:
• Finding Errors (page 243)
• Basics (page 249)
• Simple List Algorithms (page 297)
• More Basics (page 251)
• Geometric Objects (page 300)

6.1 A First Python Program

We start the Python crash course with a small program. The program will ask the user for some keyboard input. If
the user types bye the program stops, else it asks again.

6.1.1 Source Code

code = None

while code != "bye":

print("I'm a Python program. Type 'bye' to stop me:")

code = input() # get some input from user

if code == "":
print("To lazy to type?")
print("")

print("Bye")

43
Data Science and Artificial Intelligence for Undergraduates

6.1.2 Line by Line

The first line

code = None

provides space in memory to store something (this is quite unprecise, but will be made precise soon). We name this
piece of memory code since we want to store a code typed by the user. At the moment there is nothing to store,
which is expressed by None.
The second line

while code != "bye":

is the head of a loop. The subsequent indented code block is executed again and again as long as the condition code
!= "bye" is satisfied. Here, != means unequal. Thus the line

print("Bye")

is not executed before the variable code holds the value bye.

print("I'm a Python program. Type 'bye' to stop me:")

prints a message on screen and

code = input() # get some input from user

waits for user input, which then is stored in the variable code. Text following a # symbol is ignored by the Python
interpreter.

if code == "":

checks whether the variable code contains an empty string. If so,

print("To lazy to type?")

is executed. Else, execution continues printing a blank line:

print("")

The line

print("Bye")

is only executed if the user types bye. Then, after executing this, the program stops. There is no other way for the
user to stop the program (next to killing it with the operating system’s help).

6.1.3 Some More Details

In our first Python script we can make several important observations:

• The symbol = does not mean that the stuff on its left-hand side is the same as on its right-hand side. Instead,
the value on the right-hand side is assigned to the variable on the left-hand side. The process of assignment
will be made more precise later on.
• Strings are surrounded by quotation marks. Again, details will follow.

44 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

• Contrary to most other programming languages, code indentation matters. Subblocks of code have to be
indented to be recognized as such by the Python interpreter. Usual indentation width is 4 spaces, but other
widths and also tabs may be used as long as indentation is consistent throughout the file.
• There is something we call a function. Functions in our script are print and input. We can pass arguments
to functions to influence their behavior (what to print?) and functions may return some value. The latter is the
case for the input function. The return value is stored in the variable code. Note, that even if we don’t want
to pass arguments to a function, we have to write parentheses: input().

6.1.4 Errors

Programming is a very error-prone sport. There are two types of errors:

Syntax Errors
If the Python interpreter does not understand our code, we say that the code contains a syntax error. The interpreter
will complain about a syntax error and stop program execution.
Semantic Errors
If the Python interpreter understands our code, but the program behaves differently from what we expected it to do,
the code contains semantic errors. Some types of semantic errors, like division by zero, are detected by the Python
interpreter. Other types won’t be detected automatically. If the interpreter sees a semantic error, program execution
stops with some hint on the problem.
The following code contains a syntax error: missing ) in the first call to print.

code = None

while code != "bye":

print("I'm a Python program. Type 'bye' to stop me:"

code = input() # get some input from user

if code == "":
print("To lazy to type?")
print("")

print("Bye")

Input In [2]
print("I'm a Python program. Type 'bye' to stop me:"
^
SyntaxError: '(' was never closed

The next code example contains a semantic error not detectable by the interpreter: instead of == there is != in the if
statement, which checks for inequality instead of equality. Thus, the program will print the To lazy to type?
message if the user has typed something. No message will appear if the user has provided no input. That’s not what
we want the program to do.

code = None

while code != "bye":

print("I'm a Python program. Type 'bye' to stop me:")

code = input() # get some input from user

if code != "":
print("To lazy to type?")
print("")
(continues on next page)

6.1. A First Python Program 45

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

print("Bye")

Important: Writing code is not too hard. But writing code without errors is almost impossible. Finding and
correcting errors is an essential task and will consume considerable resources (time and nerves). Some advice for
finding errors:
• Always read and understand the error message (if there is one).
• If you do not understand an error message, ask a search engine for explanation.
• Line numbers shown in an error message often are correct, but sometimes the erroneous code is located above
(not below) the indicated position.
• Test run your code as often as possible. Write some lines, then test these lines. The less code you write between
tests, the more obvious is the reason for an error.
• For each error there is a solution. You just have to find it.
• The Python interpreter is always right! If it says, that there is an error, then THERE IS AN ERROR.

6.2 Building Blocks

We start our quick run through Python with essential features which can be found in almost every high-level program-
ming language. What we will meet here is known as structured programming. Later on we will move on to object
oriented programming.

6.2.1 Comments

A Python source code file may contain text ignored by the Python interpreter. Such comments help to understand and
document the source code. Everything following a # symbol is ignored by the interpreter.

a = 1 # here we could place some explanation

# this whole line is a comment and completely ignored by the interpreter

b = 2

For the Python interpreter the above code is equivalent to

a = 1
b = 2

Note that empty lines do not matter. We may place empty lines everywhere like comments to make the code more
readable.

46 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

6.2.2 Assignments

Data (numbers, strings and other) in memory can be associated to a human readable string. Such a combination of a
piece of data and a name for it is known as variable. To assign a name to a piece of data Python uses the = sign.

a = 1

The above code writes the number 1 to some location in memory and assigns the name a to it. Whenever we want
to use or modify this value we simply have to provide its name a. The Python interpreter translates the name into a
memory address.

print(a)

This prints the value of the variable a to screen.

6.2.3 Simple Data Types

We have to distiguish different types of data because each type comes with its own set of operations. Numbers can
be added and multiplied, for example, whereas strings can be concatenated but not multiplied. In Python we do not
have to care too much about choosing the correct data type, because the interpreter does much of the technical stuff
(e.g., how much memory is required?) for us.

a = 2 # an integer
b = 2.1 # a floating point number
c = "Hello!" # a string
d = True # a boolean value

print(a)
print(b)
print(c)
print(d)

2
2.1
Hello!
True

Integers

Integers are the numbers …, -2, -1, 0, 1, 2,….

Note: In most programming languages there is a maxmimum value an integer can attain, like −231 , ..., 231 + 1. In
Python there is no limit on the size of an integer.

a = 5 + 2 # addition
b = 5 - 2 # substraction
c = 5 * 2 # multiplication
d = 5 // 2 # floor division
e = 5 % 2 # remainder of devision
f = 5 / 2 # division (yields a floating point number)
g = 2 ** 5 # power

(continues on next page)

6.2. Building Blocks 47

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

print(a)
print(b)
print(c)
print(d)
print(e)
print(f)
print(g)

7
3
10
2
1
2.5
32

Undefined operations will be identified by the interpreter.

a = 1 // 0

---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 a = 1 // 0

ZeroDivisionError: integer division or modulo by zero

If we want the user to input an integer, we may use the following code

a = int(input("Give me an integer: "))

print(a)

Like print and input also int is a function. It takes a string and converts it to an integer. If this is not possible,
an error message appears and program execution is stopped.

Floating Point Numbers

Python supports floating point numbers (also known as floats) in the approximate range 1e-308…1e+308 with 15
significant decimal places (double precision in IEEE 754 standard45 ). Floating point numbers are stored as a pair of
coefficient and exponent of 2, where both coefficient and exponent are integers.
Example: 0.1875 = 3 ⋅ 2−4 with coefficient 3 and exponent -4.

Important: Most decimal fractions cannot be represented exactly as float, which may cause tiny errors in compu-
tations.
Example:

0.1 ≈ 3602879701896397 ⋅ 2−55

= 0.1000000000000000055511151231257827021181583404541015625

See Python documentation46 for more detailed explanation and additional examples.

45 https://en.wikipedia.org/wiki/IEEE_754
46 https://docs.python.org/3/tutorial/floatingpoint.html

48 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

a = 5 # integer (stored as is)

b = 5.0 # float (stored as coefficient and exponent)
c = 5.123 # float
d = c + 2 # float plus integer yields float

print(a)
print(b)
print(c)
print(d)

5
5.0
5.123
7.123

Note that Python converts data types automatically as needed. Destination type is chosen to prevent loss of data as
far as possible (cf. line 4 in the code example above). If conversion is not possible, the interpreter will complain
about.

Strings

In Python strings are as simple as numbers. Just enclose some characters in single or double quotation marks and
they will become a Python string.

a = 'Hello' # single quotation marks

b = 'my'
c = "friend!" # double quotation marks

# strings may be concatenated using +

d = a + ' ' + b + ' ' + c

print(d)

Hello my friend!

Behavior of operators like + depends on the data type of the operands. Adding two integers 123 + 456 yields the
integer 579. Adding two strings '123' + '456' yields the string '123456'.
If a string contains single quotation marks, then use double quotation marks and vice versa. Alternatively, you may
escape quotation marks in a string with a backslash.

a = "He isn't cool."

b = 'He isn\'t cool.'
c = 'He said: "Your are crazy"'
d = "He said: \"Your are crazy\""

print(a)
print(b)
print(c)
print(d)

He isn't cool.
He isn't cool.
He said: "Your are crazy"
He said: "Your are crazy"

6.2. Building Blocks 49

Data Science and Artificial Intelligence for Undergraduates

Boolean Values

Boolean values or truth values can hold either True or False. Typically, they are the result of comparisons. Boolean
values support logical operations like and, or, and not (see Logic (page 329)).

a = True
b = a and False
c = not a
d = a or b

print(a)
print(b)
print(c)
print(d)

True
False
False
True

6.2.4 Functions

A function is a piece of Python code with a name. To execute the code we have to write its name, optionally followed
by parameters (sometimes denoted as arguments) influencing the function’s code execution. After executing the
function some value can be returned to the caller.
This concept is required in two circumstances:
• a piece of code is needed several times,
• readability shall be increased by hiding some code.

Built-in Functions

Python has several built-in functions, like print and input. The print function takes one or more variables and
prints them on screen. In case of multiple arguments outputs are separated by spaces. The input function may be
called without arguments. It waits for user input and returns the input to the calling code.

a = input()
print('You typed:', a)

Above we also met the int function, which converts a string to an integer if possible. The int function behaves
exactly in the same way as all other functions, but it is not a built-in function in the stricter sense. Instead, it’s the
constructor of a class, a concept we’ll discuss later on.

Keyword Arguments

Functions accept different kinds of arguments. Some are passed as they are (like for print). Those are called
positional arguments and we meet them in almost all programming languages.
In Python often we’ll see function calls of the form some_function(argument_name=passed_value).
Such arguments are called keyword arguments and help to increase code readability. If a function accepts multiple
keyword arguments, we do not have to care about which one to pass first, second and so on. Details will be discussed
in a separate chapter on functions later on.

50 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

Function Definitions

We can define new functions with the def keyword.

def say_hello(name):
''' Print a hello message. '''

message = 'Hello ' + name + '!'

print(message)

say_hello('John')
say_hello('Anna')

Hello John!
Hello Anna!

Note the indentation of the function’s code and the docstring '''...'''. The indentation tells the Python inter-
preter which lines of code belong to the function. The docstring is ignored by the interpreter like a comment. But
tools for automatic generation of software documentation extract the docstring and process it.
To return a value (like input does) we would have to add a line containing return my_value. The re-
turn keyword stops execution of the function and returns control to the calling code. We place return wherever
appropriate for our purposes. Often, but not always, it’s in the last line of the function’s code.

Important: Variables introduced inside a function, like message above, are only accessible inside that function.
But variables defined outside a function are accessible inside functions, too. It’s considered good practice to keep
inside and outside variables separated. That is, don’t use outside variables inside a function. Instead pass all values
required by the function as arguments and return results required outside a function with return. Exceptions prove
the rule.

Errors in Functions

If there is an error in a function’s code, the Python interpreter will show an error message together with a traceback.
That’s a list of code lines leading to the erroneous line. If a program calls a function which again calls a function
which contains an error, the traceback will have three entries.
In the following example the variable name is incorrect in the print line.

def say_hello(name):
''' Print a hello message. '''

message = 'Hello ' + name + '!'

print(mesage)

say_hello('John')
say_hello('Anna')

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [15], in <cell line: 8>()
4 message = 'Hello ' + name + '!'
5 print(mesage)
----> 8 say_hello('John')
9 say_hello('Anna')

(continues on next page)

6.2. Building Blocks 51

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

Input In [15], in say_hello(name)
2 ''' Print a hello message. '''
4 message = 'Hello ' + name + '!'
----> 5 print(mesage)

NameError: name 'mesage' is not defined

If there would be an error in print (because you passed an unexpected argument, for instance), then the traceback
would have an additional entry showing the erroneous line in the definition of print.

Hint: Tracebacks may become very long if your code implies a problem in some built-in or library function. Check
the traceback carefully to find the last entry referring to your code. That’s the most likely location of the problem’s
cause.

6.2.5 Conditional Execution

Up to now program flow is linear. There is one path and the interpreter will follow this path. Here comes the first
element of flow control: conditional execution.

a = int(input('Give me a number: '))

b = int(input('Give me another number: '))

if a > b:
print('First number is greater.')
else:
print('First number is not greater.')

If the condition is satisfied, then the first code block is executed. If it is not satisfied, the else block is executed.
For equality use ==, for inequality use !=. Other comparison operators are <, >, <=, >=.
A comparison evaluates to a boolean value. Thus, more complex conditions can be constructed with the help of
boolean operators.

a = int(input('Choose a number from 1 to 10: '))

if (a >= 1) and (a <= 10):

print('Well done!')
else:
print('You still have to learn a lot...')

The else part can be omitted, if nothing is to be done.

If more than two cases (True and False) have to be distinguished, use elif, which stands for ‘else if’:

a = int(input('Give me an integer: '))

if a < 0:
print('It\'s a negative number.')
elif a == 0:
print('It\'s zero.')
elif a < 10:
print('It\'s a small positive number.')
else:
print('It\'s a large positive number.')

52 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

6.2.6 Repeated Execution

The second element of flow control, next to conditional execution, is repeated execution. Python provides two tech-
niques: while loops and for loops.

For Loops

A for loop repeats a code block a pre-specified number of times.

for k in range(1, 10):

print(k * k)

1
4
9
16
25
36
49
64
81

Note that 100 is not printed. The loop always stops before the final number is reached.

Note: Whenever you have to define a range of integers in Python the upper bound has to be the last value you need
plus 1. If you already tried some other programming language, this peculiarity of Python needs getting used to.

While Loops

A while loop repeats a code block as long as a condition is met.

my_number = 10

users_number = int(input('Guess my number: '))

while users_number != my_number:

if users_number < my_number:
print('Too small!')
else:
print('Too large!')
users_number = int(input('One more chance: '))

print('Correct!')

No Do-While Loops

Many programming languages have a so called do-while loop. That’s like a while loop, but the condition is checked
at the loop’s end. Thus, the loop’s code block is executed at least once. Python does not have a while loop.
Guido van Rossum, Python’s BDFL47 , rejected a Python enhancement proposal (PEP) which suggested to introduce
a do-while loop48 with the following words49 :
47 https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
48 https://peps.python.org/pep-0315/
49 https://mail.python.org/pipermail/python-ideas/2013-June/021610.html

6.2. Building Blocks 53

Data Science and Artificial Intelligence for Undergraduates

Please reject the PEP. More variations along these lines won’t make the language more elegant or easier
to learn. They’d just save a few hasty folks some typing while making others who have to read/maintain
their code wonder what it means.

Controlling Loop Execution

For and while loops provide the keywords break and continue. With break we can abort execution of the
loop. With continue we can stop execution of the loop’s code block and immediately begin the next iteration.
Loops may have an else code block. The else block is executed if iteration terminates regularly. It is skipped, if
iteration is stopped by break.

for k in range(1, 10):

print(k)
a = input('Do you want to see the next number (y/n)?')
if a == 'n':
break
else:
print('Now you\'ve seen all my numbers.')
print('Good bye!')

Note: Whereas for and while loops are available in almost all programming languages, the else block is a special
feature of Python.

6.2.7 Lists

Next to the simple data types above there are more complex ones. Here we restrict our attention to lists. A list can
hold a number of values. The length of a list is returned by the built-in function len. Square brackets [ and ] are
used for defining a list and for accessing single elements of a list.

a = [2, -5, 4, 3, 2, -10, 3, 4]

print('List:', a)

print('Length of list:', len(a))

print('First element:', a[0])

print('Second element:', a[1])

print('Fifth element:', a[4])

print('Last element:', a[len(a) - 1])

List: [2, -5, 4, 3, 2, -10, 3, 4]

Length of list: 8
First element: 2
Second element: -5
Fifth element: 2
Last element: 4

List indices start with 0 in Python. Consequently, the last element of an n-element list has index n-1. The above code
to access the last element is considered non-pythonic. Why this is the case and how to make it better will be discussed
later on.
Lists may contain arbitrary types of data. Even lists of lists are allowed. This way we can construct two-dimensional
data structures.

54 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

a = [[3, 4, 5], [-3, 7, 2], [4, 7, 5]]

print(a[0])

print(a[0][0], a[2][0])

[3, 4, 5]
3 4

Items of a list can be modified after creation of the list:

a = [3, 4, 5]
print(a)

a[1] = 1000
print(a)

[3, 4, 5]
[3, 1000, 5]

How to append elements to an existing list and many more list related topics will be discussed later on.

Note: A list may have length 0, that is, it may be empty. Empty lists occur frequently because often one wants to
fill lists item by item, starting with an empty list. To get an empty list in Python write [].

6.2.8 Make a Building from Building Blocks

With the above building blocks at hand we may write arbitrarily complex programs. There is nothing more we need.
It’s like Lego blocks50 . Take lots of simple blocks, add some creativity, and think about how to reach your aim step
by step.
All the other features of Python we’ll discuss soon only exist to simplify programming, save some time and make
programs more readable. But they won’t add new possibilities.

To be correct: there are a small number of additional built-in functions we need to know, like open for accessing
files.

Building everything from scratch is a long and winding road. So we’ll use other people’s code and combine it to
new and larger projects. There’s a large library of ready-to-use code snippets, called the Python standard library51 .
For specific tasks like data science and AI there are specialized libraries containg thousands of functions we may use
without implementing them ourselves. Examples are Matplotlib52 , Pandas53 , Scikit-Learn54 and Tensorflow55 .
50 https://en.wikipedia.org/wiki/Lego
51 https://docs.python.org/3/library/
52 https://matplotlib.org/
53 https://pandas.pydata.org/
54 https://scikit-learn.org/stable/
55 https://www.tensorflow.org/

6.2. Building Blocks 55

Data Science and Artificial Intelligence for Undergraduates

6.3 Screen IO

Input from keyboard and output to screen are important for a program’s interaction with the user. Here we discuss
some frequently used features of terminal and Jupyter based screen IO. Graphical user interfaces (GUIs) are possible
in Python, too, but won’t be discussed here.

6.3.1 Input

The input function does not provide more functionality than we have already seen. It takes one argument, the text
to be printed on screen, and returns a string with the user’s input.

6.3.2 Output

The print function provides some fine-tuning. We already saw that it takes an arbitrary number of arguments.
Each argument which is not a keyword argument will be send to screen. Outputs of multiple arguments are separated
by one space. Non-string arguments are converted to strings automatically.
Instead of spaces, different separators can be specified with the keyword argument sep='...'.

a = 3
b = 2.34
c = 'some_string'

print(a, b, c, sep=' | ')

3 | 2.34 | some_string

Note: Note the difference to print(a, b, c, ' | '), which yields 3 2.34 some_string | .

The print function automatically adds a line break to the output. If this is not desired we may pass something else
as keyword argument end='...', an empty string for instance.

print('some text', end='')

print(' and some more text')

some text and some more text

Line breaks may be added wherever appropriate by writing \n in a string.

print('some text with line break\nin a string')

some text with line break

in a string

Note: The \ character is not printed but interpreted as escape charater. The character following \ is a
command to the output algorithm. Next to \n we already met \' and \". If you have to print a \ on screen use \\.

Calling print without any arguments prints a line break. Thus, print() and print('') are equivalent.

56 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

6.3.3 Automatic Ouput in Interactive Python

In JupyterLab and also in the Python interpreter’s interactive mode we do not have to write print(...) everytime
we want to see some result. JupyterLab automatically prints the value of the last line in a code cell. The Python
interpreter automatically prints the value of the last command issued. Instead of

print(123 * 456)

56088

we may simply write

123 * 456

56088

While plain Python calls print on the value to output, JupyterLab calls it’s own function display. For simple data
like numbers and strings display does the same as print, but for complex data like tables or images display
produces richer output. Even audio files and videos may be embedded into Jupyter notebooks by calling display
on suitably prepared data (or leaving this to JupyterLab if data is produced in the last line of a cell).

6.4 Library Code

Source code libraries contain reusable code. In Python reusing code written by other people is very simple and there
are lots of code libraries available for free. Code libraries for Python are organized in modules and packages.

6.4.1 Python Modules

Next to built-in functions like print and input Python ships with several modules, which can be loaded on demand.
A module is a collection of additional functionality. Everybody can write and publish Python modules. How to do this
will be explained later on. Modules either are written in Python or in some other laguage, mainly the C programming
language.
Before we can use functionality of a module we have to import it:

import numpy as np

Hint: A number of modules comes pre-installed with Python (the Python standard library56 ). But many others have
to be installed separately. Whenever Python shows ModuleNotFoundError you forgot to install the module you
want to import. Install a module via Anaconda Navigator or with conda install module_name in a terminal,
see Install Jupyter Locally (page 289) project for more details.

The code above imports the module numpy and makes it accessible under the name np, which is shorter than
‘numpy’. NumPy57 is a collection of functions and data types for advanced numerical computations. We will dive
deeper into this module later on. To use NumPy’s functionality we have to write np.some_function with
some_function replaced by one of NumPy’s functions.

np.sin(0.25 * np.pi)

56 https://docs.python.org/3/library/
57 https://numpy.org

6.4. Library Code 57

Data Science and Artificial Intelligence for Undergraduates

0.7071067811865475

Here, np.pi is a floating point variable holding an approximation of 𝜋. The function np.sin computes the sine
of its argument.

Note: The name np for accessing functionality of the module numpy can be choosen freely. But for widely used
modules like NumPy there are standard names everybody should use to improve code readability. Names everybody
should use are given in a module’s documentation (look at code examples there). Keep in mind: that’s a convention,
import numpy as wild_cat would by okay, too, from the technical point of view.

6.4.2 Python Packages

A package is a collection of modules. A module from a package can be imported via import package.module.
A very important Python package is Matplotlib58 for scientific plotting:

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4, 5], [20, 18, 10, 12, 18])

plt.show()

With plt.plot we create a line plot and plt.show displays this plot. If the code runs in JupyterLab the plot is
embedded into the notebook. If run by a plain Python interpreter a window opens showing the plot.
We will come back to Matplotlib when discussing data visualization.
58 https://matplotlib.org/

58 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

6.5 Everything is an Object

The Python programming language follows some basic design principles, which make the language very clear and in
some sense esthetic. One such principle is ‘Everything is an object’.
A Python object is a collection of variables and functions. We may use objects to bring structure into a program’s
variables and functions. Each object is of a certain type. An object’s type specifies a minimum list of variables and
functions an object contains or provides. Variables and functions belonging to an object are called member variables
and member functions. A synonym for member function is method.

6.5.1 Object-Oriented Programming

Large programs have lots of variables and functions. By object-oriented programming we denote an approach to
structure the morass of variables and functions into objects. Python supports this programming style.
For the moment we look at objects as containers for structuring source code. Later on we will discuss more advanced
features of object-oriented programming like defining hierarchies of types (to untangle the morass of types…).

Example

Think of a Python program which does some geometrical computations and then plots a line and a circle. To specify
geometric properties we could use variables
• line_start_x,
• line_start_y,
• line_end_x,
• line_end_y,
• circle_x,
• circle_y,
• circle_radius.
In addition, we could have functions
• draw_line,
• draw_circle,
• move_line,
• move_circle,
• rotate_line,
and so on.
Utilizing the idea of objects we could introduce objects of type Point with member variables x and y as well as an
object of type Line with member variables start and end both of type Point. Analogously, an object of type
Circle with member variables center (of type Point) and radius would be nice. Both the Line object and
the Circle object could have member functions draw and move.
The object hierarchy could look like this:
• my_line
– start
∗ x
∗ y
– end

6.5. Everything is an Object 59

Data Science and Artificial Intelligence for Undergraduates

∗ x
∗ y
– draw()
– move(...)
– rotate(...)
• my_circle
– center
∗ x
∗ y
– radius
– draw()
– move(...)

Objects Everywhere

Surprisingly, in Python there are no variables which are not an object. Even integers are objects. Most other object-
oriented programming languages have some fundamental data types (integers, floats, characters) not represented as
objects in the sense of object-oriented programming. In Python there are no such fundamental types!

6.5.2 Accessing an Object’s Members

To get a list of all members (variables and functions) of an object, Python has the built-in function dir.

a = 2

dir(a)

['__abs__',
'__add__',
'__and__',
'__bool__',
'__ceil__',
'__class__',
'__delattr__',
'__dir__',
'__divmod__',
'__doc__',
'__eq__',
'__float__',
'__floor__',
'__floordiv__',
'__format__',
'__ge__',
'__getattribute__',
'__getnewargs__',
'__gt__',
'__hash__',
'__index__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
(continues on next page)

60 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'__le__',
'__lshift__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__neg__',
'__new__',
'__or__',
'__pos__',
'__pow__',
'__radd__',
'__rand__',
'__rdivmod__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rfloordiv__',
'__rlshift__',
'__rmod__',
'__rmul__',
'__ror__',
'__round__',
'__rpow__',
'__rrshift__',
'__rshift__',
'__rsub__',
'__rtruediv__',
'__rxor__',
'__setattr__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__truediv__',
'__trunc__',
'__xor__',
'as_integer_ratio',
'bit_count',
'bit_length',
'conjugate',
'denominator',
'from_bytes',
'imag',
'numerator',
'real',
'to_bytes']

Note: The dir function returns a list of strings. Since it’s the last line of a cell, this list is printed to screen by
JupyterLab automatically.

Most of all these members won’t be used directly. The __add__ function, for instance, is called by the Python
interpreter whenever we add integers. The following line of code is equivalent to a + 3:

a.__add__(3)

6.5. Everything is an Object 61

Data Science and Artificial Intelligence for Undergraduates

The notation object.member is the syntax to access members of an object.

6.5.3 Getting an Object’s Type

Each object has a type. Objects of identical type provide identical functionality. An object’s type is returned by
Python’s built-in function type.

type(a)

int

Note: As for dir above, the type function does not print anything. Screen output is done by JupyterLab automat-
ically here. Note that Python’s print yields <class 'int'> here whereas JupyterLab’s display produces
int to visualize type’s return value.

6.5.4 Really Everything is an Object

Note that the dot syntax object.member already appeared when accessing modules: module.function. This
is not by chance. Everything is an object in Python!

import numpy as np

type(np)

module

Importing a module leads to a new object of type module providing all the functionality of the imported module in
form of member functions and variables.

import numpy as np

dir(np)

['ALLOW_THREADS',
'AxisError',
'BUFSIZE',
'CLIP',
'ComplexWarning',
'DataSource',
'ERR_CALL',
'ERR_DEFAULT',
'ERR_IGNORE',
'ERR_LOG',
'ERR_PRINT',
'ERR_RAISE',
'ERR_WARN',
'FLOATING_POINT_SUPPORT',
'FPE_DIVIDEBYZERO',
'FPE_INVALID',
'FPE_OVERFLOW',
'FPE_UNDERFLOW',
'False_',
'Inf',
'Infinity',
(continues on next page)

62 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'MAXDIMS',
'MAY_SHARE_BOUNDS',
'MAY_SHARE_EXACT',
'ModuleDeprecationWarning',
'NAN',
'NINF',
'NZERO',
'NaN',
'PINF',
'PZERO',
'RAISE',
'RankWarning',
'SHIFT_DIVIDEBYZERO',
'SHIFT_INVALID',
'SHIFT_OVERFLOW',
'SHIFT_UNDERFLOW',
'ScalarType',
'Tester',
'TooHardError',
'True_',
'UFUNC_BUFSIZE_DEFAULT',
'UFUNC_PYVALS_NAME',
'VisibleDeprecationWarning',
'WRAP',
'_CopyMode',
'_NoValue',
'_UFUNC_API',
'__NUMPY_SETUP__',
'__all__',
'__builtins__',
'__cached__',
'__config__',
'__deprecated_attrs__',
'__dir__',
'__doc__',
'__expired_functions__',
'__file__',
'__getattr__',
'__git_version__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'__version__',
'_add_newdoc_ufunc',
'_distributor_init',
'_financial_names',
'_from_dlpack',
'_globals',
'_mat',
'_pytesttester',
'_version',
'abs',
'absolute',
'add',
'add_docstring',
'add_newdoc',
'add_newdoc_ufunc',
'alen',
'all',
(continues on next page)

6.5. Everything is an Object 63

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'allclose',
'alltrue',
'amax',
'amin',
'angle',
'any',
'append',
'apply_along_axis',
'apply_over_axes',
'arange',
'arccos',
'arccosh',
'arcsin',
'arcsinh',
'arctan',
'arctan2',
'arctanh',
'argmax',
'argmin',
'argpartition',
'argsort',
'argwhere',
'around',
'array',
'array2string',
'array_equal',
'array_equiv',
'array_repr',
'array_split',
'array_str',
'asanyarray',
'asarray',
'asarray_chkfinite',
'ascontiguousarray',
'asfarray',
'asfortranarray',
'asmatrix',
'asscalar',
'atleast_1d',
'atleast_2d',
'atleast_3d',
'average',
'bartlett',
'base_repr',
'binary_repr',
'bincount',
'bitwise_and',
'bitwise_not',
'bitwise_or',
'bitwise_xor',
'blackman',
'block',
'bmat',
'bool8',
'bool_',
'broadcast',
'broadcast_arrays',
'broadcast_shapes',
'broadcast_to',
'busday_count',
'busday_offset',
(continues on next page)

64 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'busdaycalendar',
'byte',
'byte_bounds',
'bytes0',
'bytes_',
'c_',
'can_cast',
'cast',
'cbrt',
'cdouble',
'ceil',
'cfloat',
'char',
'character',
'chararray',
'choose',
'clip',
'clongdouble',
'clongfloat',
'column_stack',
'common_type',
'compare_chararrays',
'compat',
'complex128',
'complex256',
'complex64',
'complex_',
'complexfloating',
'compress',
'concatenate',
'conj',
'conjugate',
'convolve',
'copy',
'copysign',
'copyto',
'core',
'corrcoef',
'correlate',
'cos',
'cosh',
'count_nonzero',
'cov',
'cross',
'csingle',
'ctypeslib',
'cumprod',
'cumproduct',
'cumsum',
'datetime64',
'datetime_as_string',
'datetime_data',
'deg2rad',
'degrees',
'delete',
'deprecate',
'deprecate_with_doc',
'diag',
'diag_indices',
'diag_indices_from',
'diagflat',
(continues on next page)

6.5. Everything is an Object 65

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'diagonal',
'diff',
'digitize',
'disp',
'divide',
'divmod',
'dot',
'double',
'dsplit',
'dstack',
'dtype',
'e',
'ediff1d',
'einsum',
'einsum_path',
'emath',
'empty',
'empty_like',
'equal',
'errstate',
'euler_gamma',
'exp',
'exp2',
'expand_dims',
'expm1',
'extract',
'eye',
'fabs',
'fastCopyAndTranspose',
'fft',
'fill_diagonal',
'find_common_type',
'finfo',
'fix',
'flatiter',
'flatnonzero',
'flexible',
'flip',
'fliplr',
'flipud',
'float128',
'float16',
'float32',
'float64',
'float_',
'float_power',
'floating',
'floor',
'floor_divide',
'fmax',
'fmin',
'fmod',
'format_float_positional',
'format_float_scientific',
'format_parser',
'frexp',
'frombuffer',
'fromfile',
'fromfunction',
'fromiter',
'frompyfunc',
(continues on next page)

66 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'fromregex',
'fromstring',
'full',
'full_like',
'gcd',
'generic',
'genfromtxt',
'geomspace',
'get_array_wrap',
'get_include',
'get_printoptions',
'getbufsize',
'geterr',
'geterrcall',
'geterrobj',
'gradient',
'greater',
'greater_equal',
'half',
'hamming',
'hanning',
'heaviside',
'histogram',
'histogram2d',
'histogram_bin_edges',
'histogramdd',
'hsplit',
'hstack',
'hypot',
'i0',
'identity',
'iinfo',
'imag',
'in1d',
'index_exp',
'indices',
'inexact',
'inf',
'info',
'infty',
'inner',
'insert',
'int0',
'int16',
'int32',
'int64',
'int8',
'int_',
'intc',
'integer',
'interp',
'intersect1d',
'intp',
'invert',
'is_busday',
'isclose',
'iscomplex',
'iscomplexobj',
'isfinite',
'isfortran',
'isin',
(continues on next page)

6.5. Everything is an Object 67

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'isinf',
'isnan',
'isnat',
'isneginf',
'isposinf',
'isreal',
'isrealobj',
'isscalar',
'issctype',
'issubclass_',
'issubdtype',
'issubsctype',
'iterable',
'ix_',
'kaiser',
'kernel_version',
'kron',
'lcm',
'ldexp',
'left_shift',
'less',
'less_equal',
'lexsort',
'lib',
'linalg',
'linspace',
'little_endian',
'load',
'loadtxt',
'log',
'log10',
'log1p',
'log2',
'logaddexp',
'logaddexp2',
'logical_and',
'logical_not',
'logical_or',
'logical_xor',
'logspace',
'longcomplex',
'longdouble',
'longfloat',
'longlong',
'lookfor',
'ma',
'mask_indices',
'mat',
'math',
'matmul',
'matrix',
'matrixlib',
'max',
'maximum',
'maximum_sctype',
'may_share_memory',
'mean',
'median',
'memmap',
'meshgrid',
'mgrid',
(continues on next page)

68 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'min',
'min_scalar_type',
'minimum',
'mintypecode',
'mod',
'modf',
'moveaxis',
'msort',
'multiply',
'nan',
'nan_to_num',
'nanargmax',
'nanargmin',
'nancumprod',
'nancumsum',
'nanmax',
'nanmean',
'nanmedian',
'nanmin',
'nanpercentile',
'nanprod',
'nanquantile',
'nanstd',
'nansum',
'nanvar',
'nbytes',
'ndarray',
'ndenumerate',
'ndim',
'ndindex',
'nditer',
'negative',
'nested_iters',
'newaxis',
'nextafter',
'nonzero',
'not_equal',
'numarray',
'number',
'obj2sctype',
'object0',
'object_',
'ogrid',
'oldnumeric',
'ones',
'ones_like',
'os',
'outer',
'packbits',
'pad',
'partition',
'percentile',
'pi',
'piecewise',
'place',
'poly',
'poly1d',
'polyadd',
'polyder',
'polydiv',
'polyfit',
(continues on next page)

6.5. Everything is an Object 69

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'polyint',
'polymul',
'polynomial',
'polysub',
'polyval',
'positive',
'power',
'printoptions',
'prod',
'product',
'promote_types',
'ptp',
'put',
'put_along_axis',
'putmask',
'quantile',
'r_',
'rad2deg',
'radians',
'random',
'ravel',
'ravel_multi_index',
'real',
'real_if_close',
'rec',
'recarray',
'recfromcsv',
'recfromtxt',
'reciprocal',
'record',
'remainder',
'repeat',
'require',
'reshape',
'resize',
'result_type',
'right_shift',
'rint',
'roll',
'rollaxis',
'roots',
'rot90',
'round',
'round_',
'row_stack',
's_',
'safe_eval',
'save',
'savetxt',
'savez',
'savez_compressed',
'sctype2char',
'sctypeDict',
'sctypes',
'searchsorted',
'select',
'set_numeric_ops',
'set_printoptions',
'set_string_function',
'setbufsize',
'setdiff1d',
(continues on next page)

70 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'seterr',
'seterrcall',
'seterrobj',
'setxor1d',
'shape',
'shares_memory',
'short',
'show_config',
'sign',
'signbit',
'signedinteger',
'sin',
'sinc',
'single',
'singlecomplex',
'sinh',
'size',
'sometrue',
'sort',
'sort_complex',
'source',
'spacing',
'split',
'sqrt',
'square',
'squeeze',
'stack',
'std',
'str0',
'str_',
'string_',
'subtract',
'sum',
'swapaxes',
'sys',
'take',
'take_along_axis',
'tan',
'tanh',
'tensordot',
'test',
'testing',
'tile',
'timedelta64',
'trace',
'tracemalloc_domain',
'transpose',
'trapz',
'tri',
'tril',
'tril_indices',
'tril_indices_from',
'trim_zeros',
'triu',
'triu_indices',
'triu_indices_from',
'true_divide',
'trunc',
'typecodes',
'typename',
'ubyte',
(continues on next page)

6.5. Everything is an Object 71

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'ufunc',
'uint',
'uint0',
'uint16',
'uint32',
'uint64',
'uint8',
'uintc',
'uintp',
'ulonglong',
'unicode_',
'union1d',
'unique',
'unpackbits',
'unravel_index',
'unsignedinteger',
'unwrap',
'use_hugepage',
'ushort',
'vander',
'var',
'vdot',
'vectorize',
'version',
'void',
'void0',
'vsplit',
'vstack',
'warnings',
'where',
'who',
'zeros',
'zeros_like']

If everything is an object in Python, then functions should be objects, too.

def my_function():
print('Here we could do something useful.')

type(my_function)

function

The value returned by type should be an object, too.

type(type(my_function))

type

Built-in functions have their own type.

type(print)

builtin_function_or_method

72 Chapter 6. Crash Course

Data Science and Artificial Intelligence for Undergraduates

6.5.5 Custom Object Types

We may define new objects with the class keyword. Instead of ‘object type’ one often says ‘class’. To create a class
for describing geometric points we could write

class Point:

def init(self, x, y):

self.x = x
self.y = y

This defines a new object type or class Point. Here __init__ is the only member function. The function with
this special name is implicitly called by Python whenever a new Point object has been created. This initialization
function expects the coordinates of the new point and stores them internally as member variables. The self argument
provides access to the newly created object.
A class is like a blueprint. But the methods defined in the blueprint are called for concrete objects. The self
argument gives access to the object for which a method has been called. Remember: there is only one class (blueprint),
but there may be many objects of this type.

Note: From a technical point of view instead of self we may use any name for the first argument (bad practice!).
But the first argument of a method always is the object for which the method has been called.

Note: Methods with two leading and two trailing underscores are called magic methods or dunder methods (dunder
= double underscore). They are called by the Python interpreter for special purposes. We already met another
dunder method: __add__, which is called whenever the + operator is used on an object. In contrast to most other
programming languages virtually every operation in Python can be customized by implementing a suitable dunder
method.

If we write

center = Point(3, 7)

the Python interpreter creates a new object of type Point. At this moment the new object has only one method,
__init__ and no member variables. Immediately after creation Python calls the object’s __init__ function.
The arguments passed to __init__ are the newly created object, 3 and 7.
Now we can work with our object as expected.

print(center.x)
center.x = center.x + 2
print(center.x)

3
5

Object info is as follows:

type(center)

__main__.Point

dir(center)

6.5. Everything is an Object 73

Data Science and Artificial Intelligence for Undergraduates

['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'x',
'y']

The reason for __main__ in the object type will be discussed later on. From the member list we see that there are
many automatically created members. Some of them will be discussed later on, too.

74 Chapter 6. Crash Course

CHAPTER

SEVEN

VARIABLES AND OPERATORS

Variables in Python follow a different, much more elegant approach than in other programming languages. In this
chapter we discuss the details of variables in Python and of some consequences of Python’s approach to variables.
• Names and Objects (page 75)
• Types (page 80)
• Operators (page 85)
• Efficiency (page 89)
Related exercises:
• Variables and Operators (page 253)
• Memory Management (page 256)
Related project: Vector Multiplication (page 301).

7.1 Names and Objects

In contrast to most other programming languages Python follows a very clean and simple approach to memory access
and management. We now dive deeper into Python’s internal workings to come up with a new understanding of what
we called variables in the Crash Course (page 43).

7.1.1 Variables in the Non-Pythonian World

Most programming languages, C for instance, assign fixed names to memory locations. Such combinations of memory
location and name are known as variables. Assigning a value to a variable then means, that the compiler or interpreter
writes the value to the memory location to which the variable’s name belongs. There is a one-to-one correspondence
between variable names and memory locations.
Consider the following C code:

int a;
int b;
a = 5;
b = a;

The first two lines tell the C compiler to reserve memory for two integer variables. The third line writes the value 5
to the location named a. The fourth line reads the value at the location named a and writes (copies) it to the location
named b.

75
Data Science and Artificial Intelligence for Undergraduates

Fig. 7.1: Memory is organized as a linear sequence of bytes. Used and currently unused bytes are managed by the
operating system and by compiler. In C programs there is a one-to-one correspondence between variable names and
memory locations.

7.1.2 Variables in Python

Python allows for multiple names per memory location and adds a layer of abstraction.
In Python everything is an object and objects are stored somewhere in memory. If we use integers in Python, then
the integer value is not written directly to memory. Instead, additional information is added and the resulting more
complex data structure is written to memory.
A newly created Python object does not have a name. Instead, Python internally assigns a unique number to each
object, the object identifier or object ID for short. Thus, there is a one-to-one correspondence between object IDs and
memory locations.
In addition to a list of all object IDs (and corresponding memory locations), Python maintains a list of names occuring
in the source code. Each name refers to exactly one object. But different names may refer to the same object. In this
sense Python does not know variables as described above, but only objects and names tied to objects.
Consider the following code:

a = 5
b = a

The first line creates an integer object containing the value 5 and then ties the name a to this object. The second line
takes the object referenced by the name a and ties a second name b to it.

Fig. 7.2: In Python one memory location may have several names, but a unique object ID.

Important: Assignment operation = in Python is not about writing something to memory. Instead, Python takes

76 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

the existing object on the right-hand side of = and ties an additional name to it.
The object on the right-hand side may have existed before or it may be created by some operation specified by the
code following =.
It’s also possible to create nameless objects. Simply omit name = before some object creation code.

Python has the built-in function id to get the ID of an object.

print(id(a))
print(id(b))

139672564187504
139672564187504

We see, that indeed a and b refer to the same object.

Clear distinction between names and objects in Python adds flexibility, but also requires much more care when
accessing or modifying data in memory. We will have to discuss possible pitfalls resulting from this concept at
several points later on.

7.1.3 Equality of Objects

In Python we have objects and we have values contained in the objects. Thus, there are two fundamentally different
questions which might be relevant for controlling program flow:
• Do two names refer to the same object?
• Do two objects (refered to by two names) contain the same value?
Consider the following code:

a = 1.23
b = 1.23

It creates two float objects both holding the value 1.23. To see that there are two objects we can look at the object
IDs:

print(id(a))
print(id(b))

139672516871888
139672519436656

So the answer to the first question is ‘no’, but the answer to the second question is ‘yes’.
To compare equality of objects Python knows the is operator. To compare equality of values Python has the ==
operator. Both yield a boolean value as result.

print(a is b)
print(a == b)

False
True

Negations of both operators are is not and !=, respectively. Using is is equivalent to comparing object IDs:

print(id(a) == id(b))

7.1. Names and Objects 77

Data Science and Artificial Intelligence for Undergraduates

False

Hint: Behavior of the is operator is hardwired in Python (use == on integer objects returned by id). But == simply
calls the dunder method __eq__ of the left-hand side object. Thus, what happens during comparison depends on an
object’s type. Writing your own classes (object types) you may implement the __eq__ method whenever appropriate.
Without custom implementation Python uses a default one behaving similarly to is.

7.1.4 Local versus Global Names

Names in Python have a scope, that is, a region of code where they are valid. Names defined outside functions and
other structures are referred to as global names or global variables or simply globals. If a name is defined (that is, tied
to some object) inside a function or some other structure, then the name is local. Local names are undefined outside
the function or structure they are defined in.

def my_func():
print(c)
d = 456

c = 123
my_func()
print(d)

123

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [7], in <cell line: 7>()
5 c = 123
6 my_func()
----> 7 print(d)

NameError: name 'd' is not defined

If there is a local name which is also a global name, than it’s local version is used and the global one is left untouched.

def my_func():
c = 456
print(c)

c = 123
my_func()
print(c)

456
123

But how to change a global variable from inside a function? The global keyword tells the interpreter that a name
appearing in a function refers to a global variable. The interpreter then uses the global variable instead of creating a
new local variable.

def my_func():
global c
c = 456
print(c)
(continues on next page)

78 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

c = 123
my_func()
print(c)

456
456

We cannot access a global variable from inside a function and then introduce a local variable with the same name.
This leads to an error because each name appearing in an assignment in a function is considered local throughout
the function. Consequently, accessing the value of a global variable before creating a corresponding local variable is
interpreted as accessing an undefined name. The interpreter then complains about accessing a local variable before
assignment.

def my_func():
print(c)
c = 456

c = 123
my_func()
print(c)

---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [10], in <cell line: 6>()
3 c = 456
5 c = 123
----> 6 my_func()
7 print(c)

Input In [10], in my_func()

1 def my_func():
----> 2 print(c)
3 c = 456

UnboundLocalError: local variable 'c' referenced before assignment

Important: It’s considered bad practice to use lots of global variables. Global variables result in low readability of
code. Exceptions prove the rule.

7.1. Names and Objects 79

Data Science and Artificial Intelligence for Undergraduates

7.2 Types

Here we introduce two more very fundamental object types, discuss type conversion and introduce the Python-specific
concept of immutability.

7.2.1 More Types

In the Crash Course (page 43) we met several data or object types (classes):
• integers
• floats
• booleans
• lists
We also met strings, which will be discussed in more detail later on. Python ships with some more data types. Next
to complex numbers, which we do not consider here, and several list-like types Python knows two very special data
types:
• None type
• NotImplemented type
Both types can represent only one value: None and NotImplemented, respectively. Thus, they can be considered
as constants. But since in Python everything is an object, constants are objects, too. Objects have a type (class). Thus,
there is a None type and a NotImplemented type.

print(type(None))
print(type(NotImplemented))

Existence of NotImplemented will be justified soon. Typically it’s used as return value of functions to signal the
caller that some expected functionality is not available.
The value None is used whenever we want to express that a name is not tied to an object. In that case we simply tie
the name to the None object. We write ‘the’ because the Python interpreter creates only one object of None type.
Such an object can hold only one and the same value. So there is no reason to create several different None objects.

a = None
b = None
print(id(a))
print(id(b))

94287211210560
94287211210560

a = 'Some string'
b = 'Some string'
print(id(a))
print(id(b))

139871271436464
139871271437360

80 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

In the second code block two string objects are created although both hold the same value. For both None values in
the first code block only one object is created. If you play around with this you may find, that for short strings and
small integers Python behaves like for None. This issue will be discussed in detail soon.

Hint: None is a Python keyword like if or else or import. It is used to refer to the object of None type. But
the memory occupied by this object does not neccessarily contain a string ‘None’ or something similar. In fact, this
object does not contain something useful. Its mileage is its existence, which allows to tie (temporarily unused) names
to it. Same holds for NotImplemented.
We already met this concept when introducing boolean values. True and False are Python keywords, too. They
are used to refer to two different objects of type bool. But these objects do not contain a string ‘True’ or ‘False’
or something similar. Instead, a bool object stores an integer value: 1 for the True object and 0 for the False
object. How to represent None, True and so on in memory depends on the concrete implementation of the Python
interpreter and is not specified by the Python programming language.

7.2.2 Type Casting

Type casting means change of data type. An integer could be casted to a floating point number, for example. Python
does not have a mechanism for type casting. Instead, dunder methods can be implemented to work with objects of
different types.
A very prominent dunder method for handling different object types is the __init__ method, which is called after
creating a new object. Its main purpose is to fill the new object with data. For Python standard types like int,
float, bool the __init__ method accepts several different data types as argument.
We’ve already applied the function int, which creates an int object, to strings. Thus, we have seen that the
__init__ method of int objects accepts strings as argument and tries to convert them to an integer value. The
other way round, str for creating string objects accepts integer arguments.

a = '123' # a string
b = int(a)
print(type(b))
print(b)

<class 'int'>
123

a = 123 # an integer
b = str(a)
print(type(b))
print(b)

<class 'str'>
123

a = 2 # an integer
b = float(a)
print(type(b))
print(b)

<class 'float'>
2.0

Data may get lost due to type casting. The Python interpreter will not complain about possible data loss.

7.2. Types 81
Data Science and Artificial Intelligence for Undergraduates

a = 2.34 # a float
b = int(a)
print(type(b))
print(b)

<class 'int'>
2

Hint: It’s good coding style to use explicit type casting instead of relying on implicit conversions whenever this
increases readability.
A counter example is 1.23 * 56, where the integer 56 is converted to float implicitely. Explicit casting would
decrease readability: 1.23 * float(56).

Note: If you define a custom object type, it depends on your implementation of the type’s __init__ method what
data types can be cast to your type.

7.2.3 Casting to Booleans

Casting to bool maps 0, empty strings and similar values to False, all other values to True.

print(bool(None))
print(bool(0))
print(bool(123))
print(bool(''))
print(bool('hello'))

False
False
True
False
True

If we use non-boolean values where booleans are expected, Python implicitly casts to bool:

if not '':
print('cumbersome condition satisfied')

cumbersome condition satisfied

For historical reasons boolean values internally are integers (0 or 1). This sometimes yields unexpected (but well-
defined) results. An example is the comparison of integers to True.

a = 3

if a:
print('first if')
else:
print('first else')

if a == True:
print('second if')
(continues on next page)

82 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

else:
print('second else')

first if
second else

The first condition is equivalent to bool(3), which yields True, whereas the second is equivalent to 3 == 1,
yielding False. See PEP 28559 for some discussion of that behavior (PEP 285 introduced bool to Python).

7.2.4 Immutability

Objects in Python can be either mutable or immutable. Mutable objects allow modifying the value they hold. Im-
mutable objects do not allow changing their values. Objects of simple type like int, float, bool, str are
immutable whereas lists and most others are mutable.
Understanding the concept of (im)mutability is fundamental for Python programming. Even if the source code
suggests that an immutable object gets modified, a new object is created all the time:

a = 1
print(id(a))

a = a + 1
print(id(a))

139871335317744
139871335317776

This code snipped first creates an integer object holding the value 1 and then ties the name a to it. In line 3, slop-
pily speaking, a is increased by one. More precisely, a new integer object is created, holding the result 2 of the
computation, and the name a is tied to this new object.
Mutable objects behave as expected:

a = [1, 2, 3]
print(id(a))

a[0] = 4
print(id(a))

139871271501888
139871271501888

Immutability of some data types allows the Python interpreter for more efficient operation and for code optimization
during execution. We will discuss some of those efficiency related features later on.
Always be aware of (im)mutability of your data. The following two code samples show fundamentally different
behavior:

a = 1 # immutable integer
b = a

a = a + 1
print(a, b)

59 https://peps.python.org/pep-0285/#resolved-issues

7.2. Types 83
Data Science and Artificial Intelligence for Undergraduates

2 1

Increasing a does not touch b, because the integer object a and b refer to is immutable. Increasing a creates a new
object. Then a is tied to the new object and b still refers to the original one.

a = [1, 2, 3] # mutable list

b = a

a[0] = 4
print(a, b)

[4, 2, 3] [4, 2, 3]

Modifying a also modifies b, because a and b refer to the same mutable object.

7.2.5 Getting the Type

Although rarely needed, we mention the built-in function isinstance. It takes an object and a type as parameters
and returns True if the object is of the given type.

print(isinstance(8, int))
print(isinstance(8, str))
print(isinstance(8.0, float))

True
False
True

7.2.6 Useful Dunder Functions for Custom Types

There are a bunch of dunder functions one should implement when creating custom types (cf. Custom Object Types
(page 73)):
• __str__ is called by the Python interpreter to get a text representation of an object. For instance, it’s called
by print and whenever one tries to convert an object to string via str(...).
• __repr__ is simlar to __str__ but should return a more informative string representation. In the best case,
it returns the Python code to recreate the object. See Python’s documentation60 for details.
• __bool__ is called whenever an object has to be cast to bool.
• __len__ is called by the built-in function len to determine an object’s length. This is useful for list-like
objects.
60 https://docs.python.org/3/reference/datamodel.html#object.__repr__

84 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

7.2.7 Types are Objects

Since everything in Python is an object, types are objects, too. Thus, types may provide member variables and
methods in addition to the corresponding objects’ member variables and methods. In some programming languages
members of a type are called static members.
Member variables of types occur for instance if constants have to be defined (almost always for convenience):

class ColorPair:

red = (1, 0, 0)
green = (0, 1, 0)
blue = (0, 0, 1)
yellow = (1, 1, 0)
cyan = (0, 1, 1)
magenta = (1, 0, 1)

def init(self, color1, color2):

self.color1 = color1
self.color2 = color2

my_pair = ColorPair(ColorPair.red, ColorPair.yellow)

Member functions of types are rarely used. One usecase are very flexible contructors for complex types, which do
not fit into the __init__ method due to many different variants of possible arguments. Often such constructors
are named from_... and corresponding object creation code looks like

my_object = SomeComplexType.from_other_type(arg1, arg2, arg3)

In such cases the from_... methods return a new object of corresponding type, that is, they implicitly call the
__init__ method.
Defining methods for types requires advanced syntax contructs we do not discuss here.

7.3 Operators

Like most other programming languages Python offers lots of operators to form new data from existing one. Important
classes of operators are
• arithmetic operators (+, -, *, /,…),
• comparison operators (==, !=, <, >,…),
• logical operators (and, or, not,…).

7.3.1 Operator Precedence

Expressions containing more than one operator are evaluated in well-defined order. Python’s operators can be listed
from highest to lowest priority. Operators with identical priority are evaluated from left to right.

7.3. Operators 85
Data Science and Artificial Intelligence for Undergraduates

Syntax Operator
** exponentiation
+, - (unary) sign
*, /, //, % multiplication, division
+, - (binary) addition, substraction
==, !=, <, >, <=, >= comparison
not logical not
and logical and
or logical or

See Python’s documentation61 for a complete list of all operators.

7.3.2 Chained Comparisons

We may write chained comparions like a < b < c. Python interprets them as single comparisons connected by
and, that is, a < b and b < c.

Note: Unfortunate expressions like a < b > c are allowed, too. This example is equivalent to a < b and b
> c. In a chain only neighboring operands are compared to each other! There is no comparison between a and c
here.

7.3.3 Augmented Assignments

For arithmetic binary operators there’s a shortcut for expressions like a = a + b. We may write such as a +=
b. The latter is called an augmented assignment. Although the result will look the same, there are two technical
differences one should know:
• augmented assignment may work in-place,
• for augmented assignment the assignment target will be evaluated only once.

In-place Computations

For usual binary operators the Python interpreter calls corresponding dunder methods, like __add__ for +. Aug-
mented assignments have their own dunder methods starting with i. So += calls __iadd__, for instance. The
intention is that += may work in-place, that is, without creating a new object. Of course, this is only possible for
mutable objects. If there is no dunder method for augmented assignment the interpreter falls back to the usual binary
operator’s dunder method.
Example: The + operator applied to two lists concatenates both lists. With a = a + b a new list object holding
the result is created and then the name a is tied to the new object. With a += b list b is appended to the existing
list object referred to by a.

a = [1, 2, 3]
b = [4, 5, 6]

print(id(a))

a = a + b

print(a)
print(id(a))

61 https://docs.python.org/3/reference/expressions.html#operator-precedence

86 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

139941989941376
[1, 2, 3, 4, 5, 6]
139941989942272

a = [1, 2, 3]
b = [4, 5, 6]

print(id(a))

a += b

print(a)
print(id(a))

139941989933632
[1, 2, 3, 4, 5, 6]
139941989933632

Only One Evaluation

If the assignment target is a more complex expression like for list items, the expression will be evaluated twice with
usual binary operators, but only once if augmented assignment is used.
Example: In the following code an item of a list shall be incremented by 1. The item’s index is computed by some
complex function get_index (which for demonstration purposes is very simple here). The two code cells show
different implementations, resulting in a different number of calls to get_index.

def get_index():
print('get_index called')
return 2

a = [1, 2, 3, 4]

a[get_index()] = a[get_index()] + 1

print(a)

get_index called
get_index called
[1, 2, 4, 4]

def get_index():
print('get_index called')
return 2

a = [1, 2, 3, 4]

a[get_index()] += 1

print(a)

get_index called
[1, 2, 4, 4]

Note: If efficiency matters you should prefer augmented assignments. Even a += b with integers a and b is more

7.3. Operators 87
Data Science and Artificial Intelligence for Undergraduates

efficient than a = a + b, because the name a will be looked up only once in the table mapping names to object
IDs.

7.3.4 Operators as Member Functions

All Python operators, == and + for instance, simply call a specially named member function of the involved objects,
a so called dunder method. Lines 3 and 4 of the following code cell do exactly the same thing:

a = 5

b = a + 2
c = a.__add__(2)

print(b)
print(c)

7
7

Dunder methods allow to create new object types which can implement all the Python operators themselve. What
an operator does depends on the operands’ object type. For instance, + applied to numbers is usual addition, but +
applied to strings is concatenation.

Dunder Methods for Binary Operators

For binary operators like + and == there is always the question which of both objects to use for calling the corre-
sponding dunder method. In case of comparisons Python uses the dunder method of the left-hand side operand (up to
one minor exception which we don’t discuss here). For arithmetic operations Python always tries the left operand first
(again, we omit a minor exception). If it does not have the required dunder method, then Python tries the operand
on the right-hand side. If both objects do not have the dunder method, then then interpreter stops with an error.
Binary arithmetic operations might be unsymmetric. Thus, there are two variants of most arithmetic dunder methods:
one for applying an operation as the left-hand side operand and one for applying an operation as the right-hand side
operand. For addition the methods are called __add__ and __radd__, for multiplication we have __mul__ and
__rmul__. Others follow the same scheme.

Operands of Different Types

Often binary operators shall be applied to objects of different types (adding integer and floating point values, for
instance). Even if both objects have the corresponding dunder method, one or both of them could lack code for
handling certain object types.
In such a case Python calls the dunder method and the method returns NotImplemented to signal that it doesn’t
know how to handle the other operand. Then the interpreter tries the dunder method of the other operand. If it
returns NotImplemented, too, then the interpreter stops with an error. The __add__ function of integer objects
cannot handle float objects, but __add__ of float objects can handle integers, for example:

a = 2
b = 1.23
print(a.__add__(b))
print(b.__add__(a))

NotImplemented
3.23

88 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

Writing a + b in the example above first calls a.__add__(b), which returns NotImplemented, then b.
__radd__. With b + a the first call goes to b.__add__ and no second call (to a.__radd__) is required.
An example of beneficial use of Python’s flexible mechanism for customizing operators is discussed in the project on
Vector Multiplication (page 301).

7.4 Efficiency

Python’s combination of the names/objects concept and (im)mutability tends to waste memory and CPU time:
• There might be many different objects holding all the same value. In principle, every time the number 1 occurs
in the source code, a new integer object is created.
• Tying names to other objects may leave objects without name. Such objects are no more accessible but resist
in memory.
• Modifying immutable objects requires to create new objects. Thus, even simple integer computations require
relatively complex memory management operations.
To mitigate these drawbacks, the Python interpreter uses several optimization strategies. Although such issues are
rather technical we briefly discuss them here, because they sometimes yield unexpected results.

7.4.1 Preloaded Integers

To avoid object creation every time a new integer is used, the Python interpreter pre-creates integer objects for all
integers from -5 to 256. This saves CPU time. The somewhat cumbersome range stems from statistical considerations
about integer usage.
In addition, the interpreter takes care that no integer in this range is created twice during program execution. This
saves memory. The behavior is demonstrated in the following code snipped:

a = 8
b = 4 + 4
print(id(a))
print(id(b))

140115940688336
140115940688336

Both object IDs are identical, thus only one integer object is used. Since integer objects are immutable, this cannot
cause any trouble.

7.4.2 String Interning

As for integers, the Python interpreter tries to avoid multiple string objects with the same value. Since corresponding
comparisons may require too much CPU time, this technique is only used for short strings. The rules controlling
which strings get interned and which not are relatively complex.

a = 'short'
b = 'sh' + 'ort'
print(id(a))
print(id(b))

140115859848048
140115859848048

7.4. Efficiency 89
Data Science and Artificial Intelligence for Undergraduates

a = 'very very long'

b = 'very' + ' very long'
print(id(a))
print(id(b))

140115859848944
140115899208176

7.4.3 Repeated Literals in Source Code

Before executing a Python program, the interpreter checks the syntax and creates a list of all literals. Here, literals
are all types of explicit data appearing in the source code, like integers or strings. If some literal appears multiple
times and if objects of the corresponding data type are immutable, only one object is created.

# Copy the following Python code to a text file and feed the file to the
# Python interpreter to see the effect of optimization of repreated literals.

a = 'a long string, which usually is not interned'

b = 12345678
c = 'a long string, which usually is not interned'
d = 12345678

print(id(a))
print(id(b))
print(id(c))
print(id(b))

The names a and c will point to the same string object, although the string is too long to be interned by the string
interning mechanism. The names b and d will point to the same integer object, although they are outside the range
of preloaded integers.
Care has to be taken when using interactive Python interpreters like Jupyter. If the above code snipped is executed
line by line in an interactive interpreter, then four different objects will be created, because the interpreter does not
parse the full code in advance.

Important: Executing Python code with an interactive interpreter may yield different results than executing the
same code at once with a non-interactive interpreter! In particular, performance measures like memory consumption
may differ.

a = 'a long string, which usually is not interned'

b = 12345678
c = 'a long string, which usually is not interned'
d = 12345678

print(id(a))
print(id(b))
print(id(c))
print(id(d))

140115894986160
140115859673552
140115894986352
140115859679312

90 Chapter 7. Variables and Operators

Data Science and Artificial Intelligence for Undergraduates

7.4.4 Garbage Collection

As described above, there might be objects without names. Such objects resist in memory, but are no more accessible.
To avoid filling up memory as time passes, the Python interpreter automatically removes nameless objects from
memory. This mechanism is known as garbage collection and is a feature not available in all programming languages.
In the C programming language, for instance, the programmer has to take care to free memory, if data isn’t needed
anymore.
Sometimes, especially when working with large data sets, one wants to get rid of some data in memory to have more
memory available for other purposes. One way is to tie all names refering to the no more needed object to other
objects, which is somewhat unintuitive. Alternatively, the del keyword can be used to untie a name from an object.

a = 5000
del a
print(a)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [5], in <cell line: 3>()
1 a = 5000
2 del a
----> 3 print(a)

NameError: name 'a' is not defined

The last line leads to an error message, because after executing line 2 the name a is no more valid. Note that del
only deletes the name, not the object. In the following code snipped the object remains in memory, because it has
another name:

a = 5000
b = a
del a
print(b)

5000

7.4. Efficiency 91
Data Science and Artificial Intelligence for Undergraduates

92 Chapter 7. Variables and Operators

CHAPTER

EIGHT

LISTS AND FRIENDS

We already met lists in the Crash Course (page 43). Now it’s time to discuss some details and to introduce further
list-like objects types as well as Python features for efficiently working with lists and friends.
• Tuples (page 93)
• Lists (page 95)
• Dictionaries (page 98)
• Iterable Objects (page 98)
Related exercises: Lists and Friends (page 259).

8.1 Tuples

Tuples are immutable objects which hold a fixed-size list of other objects. Imagine tuples as a list of pointers to
objects, where the number of pointers and the pointers themselve cannot be changed. But note, that the objects the
tuple entries point to can change if they are mutable. Object type for tuples is tuple.

8.1.1 Syntax

Tuples are defined as a comma separated list. Often the list is surrounded by parentheses:

a = (1, 5, 9) # a tuple of integers

b = 2, 3, 4 # works also without parentheses
c = ('some string', 'another_string') # tuple of strings
d = (42, 'some string', 1.23) # tuple with mixed types

Tuples with only one item are allowed, too. To distinguish them from single objects, a trailing comma is required.

a = 1 # integer
b = (1) # integer
c = 1, # tuple containing one integer
d = (1,) # tuple containing one integer

93
Data Science and Artificial Intelligence for Undergraduates

8.1.2 Indexing

Tuple items can be accessed by index. The first item has index 0, the second index 1, and so on. The length of a tuple
is returned by the built-in function len. Indexing with negative numbers gives the items in reverse order.

colors = ('red', 'green', 'blue', 'yellow')

print(colors[2])
print(colors[-1])
print(len(colors))
print(colors)

blue
yellow
4
('red', 'green', 'blue', 'yellow')

Subtuples can be extracted by so called slicing. Simply provide a range of indices: [2:5] gives a new tuple consisting
of the items 2, 3, 4. The new tuple is a new object and remains available even if the original tuple vanishes (cf. Garbage
Collection (page 91)).

some_colors = colors[1:3]
print(some_colors[0])
print(some_colors)

green
('green', 'blue')

Extracting every second item can be done by [3:10:2], which gives items with indices 3, 5, 7, 9. The general
syntax is [first_index:last_index_plus_one:step]. Here are some more indexing examples:

print(colors[2:]) # from 2 to end

print(colors[2:len(colors)]) # from 2 to end
print(colors[2:-1]) # from 2 to last but one
print(colors[:3]) # from 0 to 3
print(colors[:]) # all

('blue', 'yellow')
('blue', 'yellow')
('blue',)
('red', 'green', 'blue')
('red', 'green', 'blue', 'yellow')

8.1.3 Tuple Assignments

Tuples can be used for returning more than one value from a function. For this purpose Python provides the following
code construct:

a, b, c = 23, 42, 6
print(a, b, c)

23 42 6

That is, we can assigne the contents of a tuple to a tuple of names. Such constructions typically are used if a function
returns several return values packed into a tuple.

94 Chapter 8. Lists and Friends

Data Science and Artificial Intelligence for Undergraduates

def integer_division(a, b):

c = a // b
d = a % b

return c, d

quotient, remainder = integer_division(100, 13)

print(quotient)
print(remainder)

7
9

8.1.4 Tuples From Lists

Lists may be converted to tuples via usual type convertion:

some_list = [1, 2, 3, 4, 5]
some_tuple = tuple(some_list)

Note that both list and tuple point to the some items. Items aren’t copied! Modifying (mutable) list items will modify
tuple items, too. See Multiple Names and Copies (page 96) for more details.

8.2 Lists

Lists are mutable objects which hold a flexible number of other objects. Lists can be regarded as mutable tuples.
Indexing syntax is the same, including slicing. Type name for lists is list. Calling list(...) creates list from
several other types (tuples, for instance).
In principle, Python lists can hold different object types. But it is considered bad practice to use this feature. It’s
better to have lists made up of objects of the same type.

8.2.1 List Methods

List objects come with several methods for modifying them.

The append method appends an item to a list.

a = [] # empty list
b = [2, 4, 6] # list with three integer items
b.append(8) # now the list has four items
print(len(b)) # length of list

To concatenate list use the + operator or the extend method.

[1, 2, 3, 4] + [9, 8, 7] + [23, 42]

[1, 2, 3, 4, 9, 8, 7, 23, 42]

8.2. Lists 95
Data Science and Artificial Intelligence for Undergraduates

a = [1, 2, 3, 4]
a.extend([9, 8, 7])
print(a)

[1, 2, 3, 4, 9, 8, 7]

With sort we may sort a list.

a = [3, 2, 5, 4, 1]
a.sort()
print(a)

[1, 2, 3, 4, 5]

To search a list use index. This method returns the index of the first occurrence of its argument in the list.

a = [3, 2, 5, 4, 1]
print(a.index(5))

To remove a list item either use the del keyword (remove by index) or the remove method (remove by value):

a = [1, 2, 3, 4]

del a[1]
print(a)

a.remove(3)
print(a)

[1, 3, 4]
[1, 4]

8.2.2 Multiple Names and Copies

Mutability of lists may cause troubles and care has to be taken:

a = [1, 2, 3, 4]
b = a

b[0] = 9

print(a)
print(b)

[9, 2, 3, 4]
[9, 2, 3, 4]

In this code snipped a list is created and the name a is tied to it. Then this list object gets b as a second name.
Important: We have one (!) list object with two names, not two lists! Thus, modifying b also modifies a.
To copy a list use the copy method:

96 Chapter 8. Lists and Friends

Data Science and Artificial Intelligence for Undergraduates

a = [1, 2, 3, 4]
b = a.copy()

b[0] = 9

print(a)
print(b)

[1, 2, 3, 4]
[9, 2, 3, 4]

This creates a so called shallow copy. A new list object is created and the items of the new list point to exactly the
same objects as the original list. If the original list consists of immutable objects (like integers), then modifying the
copy will not alter the original. But if list items point to mutable objects, then altering the objects of the copy will
modify the original list. Copying a list including all objects the list points to, is known as deep copying. How to
automatically deep-copy a list will be discussed later on.

a = [[1, 2], [3, 4]]

b = a.copy()

# alter b (not its items); works because b is a (shallow) copy of a

b.append([5, 6])
print('a:', a)
print('b:', b)

# alter items of b; also alters items of a because b isn't a deep copy of a

b[0][0] = 9
print('a:', a)
print('b:', b)

a: [[1, 2], [3, 4]]

b: [[1, 2], [3, 4], [5, 6]]
a: [[9, 2], [3, 4]]
b: [[9, 2], [3, 4], [5, 6]]

Fig. 8.1: Python lists are lists of memory locations (object IDs). Shallow copying only copies the list of memory
locations. Deep copying also copies the data at those memory locations.

8.2. Lists 97
Data Science and Artificial Intelligence for Undergraduates

8.3 Dictionaries

Dictionaries are like lists, but indices (here denoted as keys) are not restricted to integers but can be of any immutable
type. Even tuples are allowed as keys if they do not contain mutable items. Data types for keys can be mixed. Type
name for dictionaries is dict.

8.3.1 Creating Dictionaries

Dictionary items are defined as colon separated pairs key: value.

person = {'name': 'John', 'surname': 'Doe', 'age': 42}

print(person['name'])

person['age'] += 1
print(person['age'])

John
43

To add data to an existing dictionary simply assign to the new key:

person['gender'] = 'male'

With {} we obtain an empty dictionary.

8.3.2 Dictionary Methods

Items may be removed with del(key), like for lists.

The keys method returns a list-like object containing all keys used in the dictionary. Similarly, the values method
returns all values in the dictionary and the items method returns all pairs (tuples) of keys and corresponding items.

Note: Python follows the duck typing62 approach: If it looks like a duck, walks like a duck, swims like a duck, and
quacks like a duck, then it probably is a duck.
In other words: There are many object types in Python which behave like a list, but aren’t of type list. In particular,
len and indexing syntax [...] may be used for such objects.

8.4 Iterable Objects

Tuples, lists, and dictionaries are examples of iterable objects. These are objects which allow for consecutive evalution
of their items. There exist more types of iterable objects in Python and we may define new iterable objects by
implementing suitable dunder methods.
62 https://en.wikipedia.org/wiki/Duck_typing

98 Chapter 8. Lists and Friends

Data Science and Artificial Intelligence for Undergraduates

8.4.1 For loops

We briefly mentioned for loops in the Crash Course (page 43). Now we add the details.

Basic Iteration

For loops allow to iterate through iterable objects of any kind, especially tuples, lists and dictionaries.

colors = ('red', 'green', 'blue', 'yellow')

for color in colors:

print(color)

red
green
blue
yellow

The indented code below the for keyword is executed as long as there are remaining items in the iterable object. In
the example above the name color first points to colors[0], then, in the second run, to colors[1], and so
on.
This index-free iteration is considered pythonic: it’s much more readable than indexing syntax and directly fulfills the
purpose of iteration, while indexing in most cases is useless additional effort.

Note: The range built-in function we saw in the crash course is not part of the for loop’s syntax. Instead, it’s a
built-in function returning an iterable object which contains a series of integers as specified in the arguments passed
to the range function. The returned object is of type range.
The range function takes up to three arguments: the starting index, the stopping index plus 1, and the step size.
Step size defaults to 1 if omitted. If only one argument is provided, it’s interpreted as stopping index (plus 1) and
start index is set to 0.

Iteration With Indices

If next to the items themselve also their index is needed inside the loop, use the built-in function enumerate and
pass the iterable object to it. The enumerate function will return an iterable object which yields a 2-tuple in each
iteration. The first item of each tuple is the index, the second one is the corresponding object.

colors = ('red', 'green', 'blue', 'yellow')

for index, color in enumerate(colors):

print(str(index) + ': ' + color)

0: red
1: green
2: blue
3: yellow

8.4. Iterable Objects 99

Data Science and Artificial Intelligence for Undergraduates

Iterating Over Multiple Lists in Parallel

Being new to Python one is tempted to indexing for iteration over multiple lists in parallel:

# non-pythonic!

names = ['John', 'Max', 'Lisa']

surnames = ['Doe', 'Muller', 'Lang']

for i in range(len(names)):
print(names[i] + ' ' + surnames[i])

John Doe
Max Muller
Lisa Lang

For iterating over multiple lists at the same time pass them to zip. This built-in function returns an iterable objects
which yields tuples. The first tuple consists of the first elements of all lists, the second of the second elements of all
lists and so on. The returned iterable object has as many items as the shortest list.

names = ['John', 'Max', 'Lisa']

surnames = ['Doe', 'Muller', 'Lang']

for name, surname in zip(names, surnames):

print(name + ' ' + surname)

John Doe
Max Muller
Lisa Lang

8.4.2 Comprehensions

Applying some operation to each item of an iterable object can be done via for loops. But there are handy short-hands
known as list comprehensions and dictionary comprehensions.

List Comprehensions

General syntax is

[new_item for item in some_list]

The following code snipped generates a list of squares from a list of numbers.

some_numbers = [2, 4, 6, 8]

squares = [x * x for x in some_numbers]

print(squares)

[4, 16, 36, 64]

Like in for loops also multiple lists are possible:

100 Chapter 8. Lists and Friends

Data Science and Artificial Intelligence for Undergraduates

some_numbers = [2, 4, 6, 8]
more_numbers = [1, 2, 3, 4]

products = [x * y for x, y in zip(some_numbers, more_numbers)]

print(products)

[2, 8, 18, 32]

Nested for loops work, too:

[new_item for item_a in list_a for item_b in list_b]

Same principles (zipping and nesting) work for more than two lists, too.

Note: List comprehensions have two advantages compared to for loops:

• For small loops a one-liner is more readable than several lines.
• In most cases list comprehensions are faster.

Dictionary Comprehensions

Dict comprehensions look very similar to list comprehensions. General syntax:

{new_key: new_value for some_item in some_iterable}

Example modifying values only:

person = {'name': 'John', 'surname': 'Doe', 'eyes': 'brown'}

# enclose all values with star symbols

stars = {key: '*' + value + '*' for key, value in person.items()}

print(stars)

{'name': 'John', 'surname': 'Doe', 'eyes': 'brown'}

Example modifying keys and values of a dictionary:

person = {'name': 'John', 'surname': 'Doe', 'eyes': 'brown'}

# enclose keys and values with star symbols

stars = {'*' + key + '*': '*' + value + '*' for key, value in person.items()}

print(stars)

{'name': 'John', 'surname': 'Doe', 'eyes': 'brown'}

Note: Dictionary comprehensions may be used to create new dictionaries from arbitrary iterable objects. For instance
we could loop over the zipped lists, one holding the keys and one holding the values.

8.4. Iterable Objects 101

Data Science and Artificial Intelligence for Undergraduates

Conditional Comprehensions

List and dictionary comprehensions can be extended by a condition: simply append if some_condition to the
comprehension.

some_numbers = [2, 4, 6, 8]
more_numbers = [1, 0, 2, 3]

quotients = [x / y for x, y in zip(some_numbers, more_numbers) if y != 0]

print(quotients)

[2.0, 3.0, 2.6666666666666665]

The list (or dictionary) comprehension drops all items not satisfying the condition.

8.4.3 Manual iteration

Python has a built-in function next which allows to iterate through iterable objects step by step. For this purpose
we first have to create an iterator object from our iterable object. This is done by the built-in function iter. Then
the iterator object is passed to next. The iterator object takes care about what the next item is.

a = iter([3, 2, 5])

print(next(a))
print(next(a))
print(next(a))

3
2
5

Creation of the intermediate iterator object is done automatically if iterable objects are used in for loops and com-
prehensions.

8.4.4 The in keyword

To test whether an object is contained in an iterable object, we my use the in keyword.

a = [1, 5, 6, 8]
print(1 in a)
print(10 in a)

True
False

a = {'name': 'Jon', 'age': 42}

print('name' in a)
print('Jon' in a)
print('Jon' in a.values())

True
False
True

102 Chapter 8. Lists and Friends

Data Science and Artificial Intelligence for Undergraduates

The counterpart to in is not in.

Note: Conditions a not in b and not a in b are equivalent. The first uses the not in operator, while the
second uses in and then applies the logical operator not to the result.

8.4. Iterable Objects 103

Data Science and Artificial Intelligence for Undergraduates

104 Chapter 8. Lists and Friends

CHAPTER

NINE

STRINGS

We already learned basic usage of strings in the Crash Course (page 43). Here we add the details. Correct string
handling is essential for loading data from files and also for writing to files.
• Basics (page 105)
• Special Characters (page 107)
• String Formatting (page 109)
Related exercises: Strings (page 262).

9.1 Basics

9.1.1 Substrings

Strings are sequence-type objects, that is, they can be regarded as lists of characters. Each character then is itself a
string object.

a = 'some string'
print(a[0])
print(a[1])
print(len(a))

s
o
11

Even slicing is possible.

b = a[2:8]
print(b)

me str

Remember that strings are immutable. Thus, a[3] = 'x' does not replace the fourth character of a by x, but
leads to an error message.

105
Data Science and Artificial Intelligence for Undergraduates

9.1.2 Line Breaks in String Literals

Sometimes it’s necessary to use line breaks when specifying strings in source code. For this purpose Python knows
triple quotes.

a = '''a very long string

spanning several lines
in source code'''

b = """another very long string

spanning several lines
in source code"""

print(a)
print(b)

a very long string

spanning several lines
in source code
another very long string
spanning several lines
in source code

As we see, line breaks in source code become part of the string. If this behavior is not desired, end each line with \.

a = '''a very long string\

spanning several lines\
in source code'''

print(a)

a very long stringspanning several linesin source code

A common pattern in Python source files is

a = '''\
first line
second line
last line\
'''
print(a)

first line
second line
last line

This way we get one-to-one correspondence between source code and screen output.

Note: A trailing \ tells Python that the souce code line continues on the next line, that is, that the next line break
is not to be considered as line break. Usage is not restricted to strings. Long source code lines (longer than 80
characters) should be wrapped to multiple short lines.

106 Chapter 9. Strings

Data Science and Artificial Intelligence for Undergraduates

9.1.3 Raw Strings

To have line breaks and special characters in string literals we have to use escape sequences like \n, \", and so on. If
we want to prevent the Python interpreter from translating escape sequences into corresponding characters, we may
use raw strings. A raw string is taken as is by the interpreter. To mark a string literal as raw string prepend r.

a = r'Here is a line break:\nNow we are on a new line'

print(a)
print(type(a))

Here is a line break:\nNow we are on a new line

9.1.4 Useful Member Functions

Read Python’s documentation of

• count63
• find64
• replace65
• split66
• upper67
• is...68
to get an idea of what’s possible with string methods.

9.2 Special Characters

Strings may contain characters not available on some or all keyboards, like umlauts ä, ö, ü on US keyboards. To use
such special characters in string literals Python provides escape sequences \x, \u, and \U.

9.2.1 Character Sets

The relation between characters in strings and their numerical representation in memory will be considered in detail
in the chapter on Text Files (page 114). For the moment we content ourselves with the observation that there are
several such mappings, the most prominent ones known as ASCII69 , the ISO 885970 family and several Unicode71
variants.
ASCII knows 128 different characters, the ISO 8859 family some hundred, and Unicode several million ones. Nowa-
days, Unicode in its UTF-872 variant is the standard mapping for numerical representations of characters. Even
Windows is adopting UTF-8 more and more after backing the wrong horse for two decades.
There are lots of searchable lists of Unicode characters in the web. Wikipedia’s List of Unicode characters73 is a
good starting point.
63 https://docs.python.org/3/library/stdtypes.html#str.count
64 https://docs.python.org/3/library/stdtypes.html#str.find
65 https://docs.python.org/3/library/stdtypes.html#str.replace
66 https://docs.python.org/3/library/stdtypes.html#str.split
67 https://docs.python.org/3/library/stdtypes.html#str.upper
68 https://docs.python.org/3/library/stdtypes.html#str.isalnum
69 https://en.wikipedia.org/wiki/ASCII
70 https://en.wikipedia.org/wiki/ISO/IEC_8859
71 https://en.wikipedia.org/wiki/Unicode
72 https://en.wikipedia.org/wiki/UTF-8
73 https://en.wikipedia.org/wiki/List_of_Unicode_characters

9.2. Special Characters 107

Data Science and Artificial Intelligence for Undergraduates

9.2.2 Unicode Characters in String Literals

To use Unicode characters in a string literal we either may type or copy them to the source code file or we may specify
the character numerically. If the numercial representation has two hexadecimal digits, use \x followed by the two
digits. In case of 4 digits use \u and for 8 digit characters use \U.

print('umlauts: \xe4, \xf6, \xfc')

print('Greek: \u03b1, \u03b2, \u03b3')
print('Chinese (?): \U0002070e, \U00028cd2')

umlauts: ä, ö, ü
Greek: α, β, γ
Chinese (?): ,

Note: Not all Unicode characters are available in each font.

Some Unicode codes do not represent concrete characters, but control reading direction (left to right or right to left)
and other properties like spacing.

Fig. 9.1: Collaborative editing can quickly become a textual rap battle fought with increasingly convoluted invocations
of U+202a to U+202e. Source: Randall Munroe, xkcd.com/113774

74 https://xkcd.com/1137

108 Chapter 9. Strings

Data Science and Artificial Intelligence for Undergraduates

9.3 String Formatting

9.3.1 The format Method

String objects have a format method. This method allows for converting numbers and other data to strings.

a = 4
b = 5
c = a + b
nice_string = 'The result of {} plus {} is {}.'.format(a, b, c)
print(nice_string)

The result of 4 plus 5 is 9.

Calling the format method of a string replaces all pairs {} by the arguments passed to the format method. Next to
integers also floats and strings can be passed to format. The original string is not modified because it’s immutable.
Instead, format returns a new string object.
The arguments of format can also be accessed by providing their index: the first argument has index 0, the second
has index 1, and so on.

a = 4
b = a + a
print('The result of {0} plus {0} is {1}.'.format(a, b))

The result of 4 plus 4 is 8.

For more complex output (or more readable code) keyword arguments can be passed to format.

print('Mister {name} is {age} years old.'.format(name='Muller', age='42'))

Mister Muller is 42 years old.

Converting numbers to strings sometimes requires additional parameters: How many digits to use for floats? Shall
numbers in several lines be aligned horizontally? There is a whole ‘mini language’ for writing formatting options.
Here we provide only some examples. For details see Python documentation on format string syntax75 .

# print integer as float with 2 decimal places

print('The result is {:.2f}.'.format(3))

# right-align integers in fixed-width area

print('number of apples: {:3d}'.format(2))
print('number of oranges: {:3d}'.format(145))

# same works for floats and strings

print('percentage of apples: {:7.2f}'.format(2.1234))
print('percentage of oranges: {:7.2f}'.format(80.8976455))
print('percentage of bananas: {:7}'.format('unknown'))

The result is 3.00.

number of apples: 2
number of oranges: 145
percentage of apples: 2.12
percentage of oranges: 80.90
percentage of bananas: unknown

75 https://docs.python.org/3/library/string.html#formatstrings

9.3. String Formatting 109

Data Science and Artificial Intelligence for Undergraduates

If indices or names are used for refering to format’s arguments, they have to be placed on the left-hand side of the
collon: {name:3.2f}.

9.3.2 Formatted String Literals (f-Strings)

Starting with version 3.6 of Python there is a more comfortable way to format strings. It’s very similar to formatting
via format method, but requires less code and increases readability. The two major differences are:
• string literals have to be prefixed by f,
• the curly braces may contain a Python expression (an object name, for instance).

a = 123
print(f'Here you see {a}.')

b = 4.56789
print(f'Formatting works as above: {b:.2f} is a rounded float.')

Here you see 123.

Formatting works as above: 4.57 is a rounded float.

For details see formatted string literals76 in the Python documentation.

76 https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals

110 Chapter 9. Strings

CHAPTER

TEN

ACCESSING DATA

There exist several sources of data: files, data bases, and web services, for instance. Here we consider basic file access
and common file formats for storing large data sets. We also learn how to automatically download data from the web
and how to scrape data from websites.
• File IO (page 111)
• Text Files (page 114)
• ZIP Files (page 117)
• CSV Files (page 118)
• HTML Files (page 119)
• XML Files (page 121)
• Web Access (page 122)
Related exercises: File Access (page 264).
Related projects:
• Cafeteria (page 315)
• Weather (page 305)
– DWD Open Data Portal (page 305)
– Getting Forecasts (page 307)

10.1 File IO

Next to screen IO, input from and output to files is the most basic operation related to data processing. Almost all
data science projects start with reading data from one or more files. In this chapter we discuss basic file access. Later
on there will be several specialized modules and functions to make things more straight forward. But from time to
time, in case of uncommon file formats, one has to resort to the most basic operations.

10.1.1 Basics

Reading data from a file or writing data to a file requires three steps:
1. Open the file
Tell the operating system, that file access is required. The operating system checks permissions and, if everything is
okay, returns a file identifier (usually a number), which has to be used for all subsequent file operations.
2. Read or write data
Tell the operating system to move data between the file and some place in memory which can be accessed by the
Python interpreter.

111
Data Science and Artificial Intelligence for Undergraduates

3. Close the file

Tells the operating system, that file access is no longer required. The operating system, thus, knows that other appli-
cations now may read from or write to the file.
In Python all file related data and operations are encapsulated into a file object. There are different types of file objects
depending on the file type (text or binary) and on some technical issues. All types of file objects provide identical
member functions for reading and writing. Here is the basic procedure:

f = open('testdir/testfile.txt', 'r')
file_content = f.read()
f.close()

print(file_content)

Some text
in some file
splitted over
multiple lines.

This code snipped opens a file for reading (argument 'r') and assignes the name f to the resulting file object. Then
the whole content of the file is stored in the string object file_content. Finally, the file is closed and it’s content
is printed to screen.
If something goes wrong, for instance the file does not exist, the Python interpreter stops execution with an error
message. For the moment, we do not do any error checking when operating with files (this is very bad practice!).

Note: The read method and all other methods for reading and writing files can be used to process text data and
binary data. Providing the 'r' argument to open tells Python to open the file as text file. Reading data from the
file results in a string object. If ‘rb' is used instead, then the file is handled as binary file and reading results in a list
of bytes. Details will be discussed in the chapter on Text Files (page 114).
Default mode is 'r'. So specifying no mode opens for reading in text mode.

Important methods for reading and writing files are read, readline, readlines, write, writelines,
seek. See methods of file objects77 in the Python documentation.
For more details on access modes see documention of open78 .

10.1.2 Paths are OS Dependent

Paths to a file are operating system dependent. Thus, using paths in the open function makes our code operating
system dependent. This should be avoided and luckily there are techniques to avoid such OS dependence.
Linux/Unix/macOS
In Linux and other Unix like systems (macOS for instance), all files can be accessed via paths of the form ‘/
directory/subdirs/file’. That is, a list of directory names separated by slashs and ending with the file
name. If the path starts with a slash, then it’s an absolute path, else a relative one.
Drives can be mounted as directory everythere in the file system’s hierarchy. Thus, there is no need for special drive
related path components.
Windows
Windows uses a different format: 'drive:\directory\subdirs\file'. Instead of slashs backslashs are
used as delimiters and there is an additional drive letter in absolute paths followed by a colon. The purpose of the
drive letter is to select one of several physical (or even logical) drives.
From the programmer’s point of view, additional effort is required to make code work in both worlds.
77 https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
78 https://docs.python.org/3/library/functions.html#open

112 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

10.1.3 OS Independent Paths in Python

The Python module os.path provides the function join. This function takes directory names and a file name as
arguments and returns a string containing the corresponding path with appropriate (OS dependent) delimiters. So
output of the follow code snipped depends on the OS used for execution.

import os.path

test_path = os.path.join('testdir', 'testfile.txt')

print(test_path)

testdir/testfile.txt

The path separator (/ or \) used by the OS is available in os.sep.

import os

print('path separator:', os.sep)

path separator: /

10.1.4 Directory Listings

Often data sets are scattered over many files, for instance one file per customer, each file containing all the customers
transactions in an online shop. In such cases we need to get a list of all files in a specified directory. Such functionality
is provided by Python’s glob module.

import glob

file_list = glob.glob('testdir/*')

for file in file_list:

print(file)

testdir/test.zip
testdir/testwrite-windows.txt
testdir/iso8859-1.txt
testdir/umlauts.txt
testdir/testfile.txt
testdir/testwrite.txt
testdir/utf-8.txt

The glob module’s glob function takes a path containing wildcards like * (arbitrary string) and ? (arbitrary char-
acter), for instance, and returns a list of all files matching the specified path.

Note: To make above code snipped OS independent we should write glob.glob(os.path.

join('testdir', '*')).

10.1. File IO 113

Data Science and Artificial Intelligence for Undergraduates

10.2 Text Files

Text files are the most basic type of files. They contain string data. Historically there was a one-to-one mapping
between byte values (0…255) and characters. Nowadays things are much more complex, because representing all
the world’s languages requires more than 256 different characters. When reading from and writing to text files the
mapping between characters and there numerical representation in memory or storage devices is of uttermost impor-
tance.
Text file not only contain so called pritnable characters like letters and numbers, but also control characters like line
breaks and tab stops. Related issues will be discussed in this chapter, too.

10.2.1 Encodings

Every kind of data has to be converted to a stream of bits. Else it cannot be processed by a computer. For strings we
have to distinguish between their representation on screen (which symbol) and their representation in memory (which
sequence of bits). Mapping between screen and memory representation is known as encoding. Decoding is mapping
in opposite direction.

Fig. 10.1: Fortunately, the charging one has been solved now that we’ve all standardized on mini-USB. Or is it micro-
USB? Shit. Source: Randall Munroe, xkcd.com/92779

ASCII

Historically, each character of a string has been encoded as exactly one byte. A byte can hold values from 0 to
255. Thus, only 256 different characters are available, including so called control characters like tabs and new line
characters.
The mapping between byte values and characters, the so called character encoding, has to be standardized to allow
exchanging text files. For a long time, the most widespread standard has been ASCII (American Standard Code for
Information Interchange). But since ASCII does not contain special characters like umlauts in other languages, several
other encodings were developed. The ISO 885980 family is a very prominent set of ASCII derivates.
79 https://xkcd.com/927
80 https://en.wikipedia.org/wiki/ISO/IEC_8859

114 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

The first 128 characters of almost all encodings coincide with ASCII, but the remaining 128 contain different symbols.
Thus, to read text files one has to know the encoding used for saving the file. Typically, the encoding is not (!) saved
in the file, but has to be guessed or communicated along with the file. Have a look at the list of encodings81 Python
can process.

Unicode

Nowadays, Unicode is the standard encoding. More precisely, Unicode defines a group of encodings. We do not go
into the details here. For our purposes it suffices to know that Unicode contains several hundred thousand symbols
and the most important encoding of Unicode is called UTF-8. The eight means that most characters require only 8
bits. The symbols associated with the byte values 0 to 127 coincide with ASCII. A byte value above 127 indicates a
multi-byte symbol comprising two, three, or four bytes.
Linux/Unix/macOS
Non-Windows systems (Linux, Unix, macOS) have native UTF-8 support for decades. It’s the standard encoding for
Websites and other internet related applications.
Windows
Windows, even Windows 10, uses a different Unicode encoding under the hood and supports UTF-8 at the surface
only. Sometimes, if one has to dig deeper into the system, unexpected things may happen. Older Windows version
did not have UTF-8 support at all. Always check the encoding if you work with text data generated on a Windows
system!

Encodings in Python

Python uses UTF-8 and strictly distinguishs between strings and their encoded representation. The string is what we
see on screen, whereas the encoded form is what is written to memory and storage devices.
String objects provide the encode member function. This function returns a sequence of bytes. This sequence is of
type bytes. A bytes object is immutable. In essence, it’s a tuple of integers between 0 and 255.
The other way round bytes objects provide a member function decode to transform them to strings.

a = 'some string with umlauts: ä, ö, ü'

b = a.encode()
print(b)

b'some string with umlauts: \xc3\xa4, \xc3\xb6, \xc3\xbc'

As we see, bytes objects can be specified like strings, but prefixed by b. The only difference is that all bytes
holding values above 127 or non-printable characters (line breaks, for instance) are replaced by their integer values
in hexadecimal notation with the prefix \x, which is the escape sequence for specifying characters in hexadecimal
notation. If we want to use octal notation, the escape sequence is \000 where 000 is to be replaced by a three digit
octal number.

c = b.decode()
print(c)

some string with umlauts: ä, ö, ü

Note: The encode and decode methods accept an optional encoding parameter, which defaults to 'utf-8'.

There is also a mutable version of bytes objects: bytearray objects. They provide a decode function, too.
81 https://docs.python.org/3/library/codecs.html#standard-encodings

10.2. Text Files 115

Data Science and Artificial Intelligence for Undergraduates

Reading from a file opened in text mode is equivalent to reading after opening in binary mode followed by a call to
decode. Similarly for writing. The open function has knowns an optional encoding parameter for text mode,
defaulting to 'utf-8'.

10.2.2 Line Breaks

Encoding line breaks in text files is done differently on different operating systems. The ASCII and Unicode standards
define two symbols indicating a line break. One is symbol 10, known as line feed (LF for short). The other is symbol
13, known as carriage return (CR for short).
Historically, when typewriters were the standard text processing tools, starting a new line required two actions: move
to next line without moving the carriage, then move the carriage to its rightmost position. Thus, there are two different
symbols for these two actions.
Linux/Unix/macOS
Linux and other Unix like system (macOS, for instance) use single byte line breaks encoded by LF. Old versions of
macOS used CR, but then developers switched to LF.
Windows
Windows adhers to the two-step legacy from pre-computer era. That is, on Windows line breaks in text data are
encoded by the two bytes CR and LF.
Python can handle all three versions of line break codes (LF, CR, CR LF) and tries to hide the differences from the
programmer. But be aware, that writing text files may produce different results on Windows and Linux/Unix/macOS
machines.

10.2.3 Encoding Problem Examples

import os.path

Wrong Encoding

If we open an ISO 8859-1 encoded text file without specifying an encoding (that is, UTF-8 is used), the interpreter
fails either fails to interpret some bytes or it shows wrong symbols.

f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')

text = f.read()
f.close()

print(text)

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
----> 2 text = f.read()
3 f.close()
5 print(text)

File ~/anaconda3/envs/ds_book/lib/python3.10/codecs.py:322, in␣

↪BufferedIncrementalDecoder.decode(self, input, final)

319 def decode(self, input, final=False):

320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
(continues on next page)

116 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 24:␣

↪invalid continuation byte

If we open an UTF-8 encoded file with ISO 8859-1 decoding we see garbled symbols.

f = open(os.path.join('testdir', 'utf-8.txt'), 'r', encoding='iso-8859-1')

text = f.read()
f.close()

print(text)

Some umlauts: Ã¤, Ã¶, Ã¼.

This file is UTF-8 encoded.

Writing Line Breaks

The following code produces different files on Linux/Unix/macOS and Windows.

f = open(os.path.join('testdir', 'testwrite.txt'), 'w')

text = f.write('test\n\n\n\n\n\n\n\n\n\ntest')
f.close()

On Linux and Co. the file will have 18 bytes. On Windows it will have 28 bytes due to Windows’ 2-byte line breaks.
Opening the file in binary mode shows the line break encoding:

f = open(os.path.join('testdir', 'testwrite.txt'), 'rb')

text = f.read()
f.close()

print(tuple(text))

(116, 101, 115, 116, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 116, 101, 115, 116)

Using print(text) directly shows line breaks as \n, which is nice almost always, but not here. So we convert
the bytes object to a tuple of integers before printing.
If the file has been writen on a Windows machine, it looks like that:

(116, 101, 115, 116, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13,
↪ 10, 13, 10, 13, 10, 116, 101, 115, 116)

10.3 ZIP Files

Large data sets usually ship as compressed files, mostly ZIP files. Extracting such files requires lots of disk space.
Compressed text files, for instance, are much smaller than the original files (factor 5 to 10).
In Python we might use the zipfile module. This module allows to read single files from a ZIP archive without
extracting the whole archive. We have to create an object of type zipfile.ZipFile. Such objects provide an
open method. The return value of open is a file-like object, that is, it can be processed like usual files. Files from
ZIP archives are always opened in binary mode by ZipFile objects’ open method.
The namelist method returns a list of file names in the ZIP archive.

10.3. ZIP Files 117

Data Science and Artificial Intelligence for Undergraduates

import os.path
import zipfile

# open zip file

zf = zipfile.ZipFile(os.path.join('testdir', 'test.zip'))

# show contents of zip file

print(zf.namelist())

# read one specific file from zip file

f = zf.open('file.txt')
print('file contents:')
print(f.read().decode()) # opened in binary mode!
f.close()

# close zip file

zf.close()

['another_file.txt', 'file.txt']
file contents:
This is a file for testing zipfile module.

10.4 CSV Files

The simplest form for storing spreadsheet data are comma separated values (CSV) in a text file. Each line of a CSV
file contains one row of the spreadsheet. The columns are separated by commas and sometimes by another symbol.
CSV files may contain column headers in their first line(s).
A typical CSV file looks like that:

first_name,last_name,town
John,Miller,Atown
Ann,Abor,Betown
Bob,Builder,Cetown
Nina,Morning,Detown

CSV files are not standardized. Thus, there might be cumbersome deviations from what one expects to be a simple
CSV file. The CSV format is used to move data between different sources which cannot read each others native file
formats.
In Python we can use the module csv for reading data from CSV files. It provides the class csv.reader. When
creating a csv.reader object we have to pass a file object of the CSV file as parameter. The csv.reader object
then is an iterator object. It yields one line of the CSV file per iteration. More precisely, it yields a list of strings.
Each string contains the data from the corresponding column.
See documentation of csv module82 for details.
82 https://docs.python.org/3/library/csv.html

118 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

10.5 HTML Files

HTML (hypertext markup language) files are text files containing additional information for rendering text like font
type, font size, foreground and background colors. Also images, tables and other objects may be described or ref-
erenced by a HTML document. Typically, HTML files are interpreted and rendered by web browsers. Almost all
websites consist of HTML files.
In data science knowing some basic HTML is important for webscraping, that is, for automatically extracting infor-
mation from websites.

10.5.1 HTML fundamentals

A very basic HTML file looks like this:

<html>
<head>
<title>Title of webpage</title>
</head>
<body>
<h1>Some heading</h1>
<p>Text and text and more text in a paragraph.
Here comes a <a href="http://some.where">link to somewhere</a>.</p>
</body>
</html>

The file starts with <html> and ends with </html>. Then there is a head and a body. The head contains auxiliary
information like the webpage’s title, which is often shown in the browser window’s title bar. The body contains the
contents of the page.
There are many different HTML tags to influence rendering of the contents.
Headings from large to small: h1, h2, h3, h4, h5.
Paragraph: p.
Link: a with attribute href.
Table: table, tr (row inside table), td (cell inside row), and some more.
Image: img with attribute src (the URL of the image).
Invisible elements for layout control: span (inline element), div (box).
All tags have the attributes style (for specifying font size, colors and so on), id (a unique identifier for advanced
style control and scripting), class (an identifier shared by several elements for advanced layout control).
Have a look at the HTML documentation83 for details.
Modern browsers have tools to help understand a HTML file’s structure. In Firefox or Chromium right-click some
element of the webpage and click ‘Inspect’ in the pop-up menu. Then navigate through the HTML source. To see the
whole HTML source code right-click and choose ‘View Page Source’.
83 https://html.spec.whatwg.org/

10.5. HTML Files 119

Data Science and Artificial Intelligence for Undergraduates

10.5.2 Parsing HTML files with Python

There are several modules available for parsing HTML files in Python. Here, parsing means to convert the textual
representation into more structured Python objects. One such module is Beautiful Soup84 , which is not part of
Python’s standard library, but has to be installed manually.
For installation use beautifulsoup4. For importing bs4 is the correct name.

import bs4

We have to create a BeautifulSoup object, whoes contructor takes a string or an opened file object as argument.
The BeautifulSoup object then provides methods to find HTML tags by specifying tag name, id attribute, class
attribute or one of several other properties. We do not have to write code for parsing HTML files. Instead we can
search the file with BeautifulSoup’s methods.

html = '''\
<html>
<head>
<title>Title of webpage</title>
</head>
<body>
<h1>Some heading</h1>
<p>Text and text and more text in a paragraph.
Here comes a <a href="http://some.where">link to somewhere</a>.</p>
</body>
</html>
'''

soup = bs4.BeautifulSoup(html)

The find_all method returns a list of objects representing subsets of the HTML file matching the arguments
passed to find_all. In the following code snippet we search for a tags, that is, for links. But we could also search
for certain attribute values and other criteria. There is also a find method which returns the first occurrence only.
The objects returned by find_all and find themselves provide corresponding methods to refine search.

# find all links

links = soup.find_all('a')

print('#links:', len(links))
print('last link:', links[-1])

#links: 1
last link: <a href="http://some.where">link to somewhere</a>

See Beautiful Soup’s documentation85 for details.

84 https://www.crummy.com/software/BeautifulSoup
85 https://beautiful-soup-4.readthedocs.io

120 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

10.6 XML Files

XML (extensible markup language) files look like HTML Files (page 119), but with custom tag names. Each XML
file may use its own set of tags to describe data. In principle, HTML is a special case of XML.
Standard conforming XML files have some format specifications in there first lines. Content without such specifica-
tions could look like this:

<person>
<first_name>John</first_name>
<last_name>Miller</last_name>
<town>Atown</town>
</person>
<person>
<first_name>Ann</first_name>
<last_name>Abor</last_name>
<town>Betown</town>
</person>
<person>
<first_name>Bob</first_name>
<last_name>Builder</last_name>
<town>Cetown</town>
</person>
<person>
<first_name>Nina</first_name>
<last_name>Morning</last_name>
<town>Detown</town>
</person>

HTML files can be parsed like HTML files with Beautiful Soup.

import bs4

xml = '''\
<person>
<first_name>John</first_name>
<last_name>Miller</last_name>
<town>Atown</town>
</person>
<person>
<first_name>Ann</first_name>
<last_name>Abor</last_name>
<town>Betown</town>
</person>
<person>
<first_name>Bob</first_name>
<last_name>Builder</last_name>
<town>Cetown</town>
</person>
<person>
<first_name>Nina</first_name>
<last_name>Morning</last_name>
<town>Detown</town>
</person>
'''

soup = bs4.BeautifulSoup(xml)

# find all towns

towns = soup.find_all('town')

(continues on next page)

10.6. XML Files 121

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

print('#towns:', len(towns))
print('last town:', towns[-1])

#towns: 4
last town: <town>Detown</town>

10.7 Web Access

Today’s primary source of data is the world wide web. In the simplest case we may download a data set as one single
file. Many data providers instead offer an API (application programming interface) for accessing and downloading
data. The worst case is if we have to scrape data from a website’s HTML and other files.

10.7.1 Server, Client, Browser

Websites and other web services are hosted on a server somewhere in the world. If we type an URL (web address)
into a browser’s address bar, the browser connects to the corresponding server and asks him to send the desired file
to the user’s computer. This process is referred to as requesting a file or sending a request. Our computer is the client,
asking the server for some service (send a file). It’s important to understand that we cannot simply collect a file from
a remote server. We only may send a request to the server to send a file to us. The server may fulfill our request or
send an error message or do not answer the request at all.
The technology behind is much more involved than one might think: How to find the correct server? Which language
to speak with the server? What to do if the server does not answer the request? And so on. If you are interested in
some background details, use DNS86 and HTTP87 as entry points.
The Python interpreter may take the role of the browser and request files from servers.

10.7.2 Downloading Files with Python

To download a webpage or some other file from the the web we may use the requests module from the Python
standard library.
The module provides a function get which takes the URL and yields a Response object. The Response object
contains information about the server’s answer to our request. If the request has been succesful, the content
member variable contains the requested file as bytes object.

import requests

response = requests.get('https://www.fh-zwickau.de/~jef19jdw/index.html')

print(response.content.decode())

86 https://en.wikipedia.org/wiki/Domain_Name_System
87 https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol

122 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

10.7.3 Web APIs

Many webpages are to some extent dynamic. Their content can be influenced by passing parameters to them. Different
techniques exist for this purpose. Most common are so-called ‘GET’ and ‘POST’. We only consider the first method
here.
Passing arguments via ‘GET’ is very simple. We just add them to the URL. If the webpage processes arguments with
names arg1, arg2, arg3 and if we want to pass corresponding values value1, value2, value3, we may
request the URL

http://some.where/some_page.html?arg1=value1&arg2=value2&arg3=value3

The requests.get function knows the keyword argument params to increase readability. Instead of composing
a long URL string we may write:

url = 'http://some.where/some_page.html'
params = {'arg1': 'value1',
'arg2': 'value2',
'arg3': 'value3'}
response = requests.get(url, params=params)

Most web services for data retrieval do not return HTML document, but more machine readable formats like CSV88 ,
JSON89 , or YAML90 . There are Python modules for parsing all common formats.

10.7.4 Web Scraping

Sometimes data we want to analyze is scattered over a website. No direct connection to the underlying data base is
available. Thus, we have to find ways to extract data from websites automatically. The process of extracting data
from websites is referred as web scraping.

Legal considerations

There is no law which directly prohibits web scraping. But a website or part of it may be protected by copyright law.
Almost all large websites have terms of use, which have to be respected by the user. Some websites explicitly prohibit
automated data extraction. Some only prohibit commercial use of the provided data. Before starting a scraping project
read the terms of use!
When in doubt ask the website provider for written permission to scrape data from the site or ask a lawyer!
Another issue is the web traffic caused by scrapers. A scraping project might require several thousand requests to
a server within very short time. This may hurt the providers infrastructure. A common attack for getting down a
website is to send thousands of requests fast enough to prevent the server from answering requests from other users
(DoS attack, denial-of-service attack). We don’t want to be attackers. Thus, whenever you start a scraping project,
tell your script to wait a few seconds between consecutive requests to a server!
88 https://en.wikipedia.org/wiki/Comma-separated_values
89 https://en.wikipedia.org/wiki/JSON
90 https://en.wikipedia.org/wiki/YAML

10.7. Web Access 123

Data Science and Artificial Intelligence for Undergraduates

Scraping Media Files from Websites

Download a webpage via requests.get only yields the HTML document. Images and other media usually are
not contained in HTML files. To download all images of a webpage we would have to find all img tags in the HTML
file, then extract URLs from corresponding src attributes, and then download each URL separately.

Useful Little Helpers

Scraping data from websites often is tedious work and each scraping project requires different techniques for data
extraction. Knowing some little helpers may save the day.

Regular Expressions

There is a mini language to describe query strings for text search. Corresponding search strings are called regular
expressions. They can be used, for instance, in conjunction with Beautiful Soup. We do not go into the details here.
Just an example:

import re # Python's support for regular expressions

some_string = 'banana, apple, cucumber, orange'

pattern = '[aeiou].[aeiou]' # vowel, some letter, vowel

result = re.findall(pattern, some_string)

print(result)

['ana', 'ucu', 'ora']

For details and more examples see documention of re module91 .

Dates and Times

Most data contains time stamps. Python ships with the modules datetime and time for handlung dates and times.
The former provides tools for carrying out calculations with dates and times. The latter provides different time-related
functionality.
datetime provides objects expressing a point in time (date, time, datetime) and objects expressing a duration
(timedelta).

import datetime

some_date = datetime.date(2020, 6, 23)

some_delta = datetime.timedelta(weeks=2)

new_date = some_date + some_delta

print(f'It\'s {new_date.day:02}.{new_date.month:02}.{new_date.year}.')

It's 07.07.2020.

For details see documention of datetime module92 .

From time module we might use time.sleep to realize some delay between subsequent requests to a server.
91 https://docs.python.org/3/library/re.html
92 https://docs.python.org/3/library/datetime.html

124 Chapter 10. Accessing Data

Data Science and Artificial Intelligence for Undergraduates

import time

print('Have a break...')
time.sleep(5) # seconds
print('...now I\'m back.')

Have a break...
...now I'm back.

For details see documention of time module93 .

93 https://docs.python.org/3/library/time.html

10.7. Web Access 125

Data Science and Artificial Intelligence for Undergraduates

126 Chapter 10. Accessing Data

CHAPTER

ELEVEN

FUNCTIONS

We already met functions in the Crash Course (page 43). Here we repeat the basics and add lots of details important
for successful Python programming.
• Basics (page 127)
• Passing Arguments (page 128)
• Anonymous Functions (Lambdas) (page 132)
• Function and Method Objects (page 132)
• Recursion (page 133)
Related exercises: Functions (page 265).

11.1 Basics

A function has a name, which is used to call the function. A function can take arguments to control its behavior, and
a function may return a value, which then can be used by the calling code.

11.1.1 Function Definition

Functions are defined as follows in Python:

def my_function(argument, another_argument, more_arguments):

# indented block of code

11.1.2 Returning Values

If a function shall return a value, then its code block has to contain the following line at least once (usually the last
line):

return some_value

The return keyword immediately stops execution of the function’s code and hands control back to the calling code,
which then can use the content of some_value.
If there is no return keyword in a function, then the function ends after executing its last line and returns None.
In this sense, Python functions always return a value.

127
Data Science and Artificial Intelligence for Undergraduates

11.1.3 Function Calls

A function is called es follows:

a = my_function(value_for_argument, value_for_another_argument, values_for_more_

↪arguments)

After finishing execution of the function’s code a contains the return value of the function. If the function does not
return a value or the return value is not needed by the calling code, then the assignment a = ... can be omitted.

11.1.4 Documentation Strings (Docstrings)

In Python by convention every function definition contains a triple quoted documentation string. This string is ignored
by the Python interpreter, but read by tools for automatic generation of source code documention.

def my_function():
'''Does nothing.

Takes no arguments und returns nothing.

'''
# some code

For some formatting conventions see Python documention94 . More details: PEP 25795 . There are many different
conventions for docstring formatting. PEP 257 is only one of them.

11.2 Passing Arguments

In contrast to several other programming languages Python provides very flexible and readable syntax constructs for
passing data to functions. Here we’ll also discuss what happens in memory when passing data to functions.

11.2.1 Positional Arguments

Positional arguments have to be passed in exactly the same order as they appear in the function’s definition. There
can be as many positional arguments as needed. But a function may come without any positional arguments at all,
too.
Positional arguments may have a default value, which is used if the argument is missing in a function call. Syntax:

def my_function(arg1=default_value, arg2=default_value2):

# indented block of code

If there are mandatory arguments (that is, without default value) and optional arguments, then the latter have to follow
the former.
94 https://docs.python.org/3/tutorial/controlflow.html#tut-docstrings
95 https://www.python.org/dev/peps/pep-0257/

128 Chapter 11. Functions

Data Science and Artificial Intelligence for Undergraduates

11.2.2 Keyword Arguments

If there are several optional arguments and only some shall be passed to the function, they can be provided as keyword
arguments. The order of keyword arguments does not matter when calling a function.

def my_function(arg1=default1, arg2=default2, arg3=default3):

# indented block of code

my_function(arg3=12345, arg2=54321)

Keyword arguments are very common in Python, since they increase readability when calling functions with many
arguments.

11.2.3 Arbitrary Number of Positional Arguments

If we need a function which can take an arbitrary number of arguments, we may use the following snytax:

def my_function(arg1, arg2, *other_args):

# indented block of code

Then other_args contains a tuple of all arguments passed to the function, but without arg1 and arg2.

11.2.4 Arbitrary Number of Keyword Arguments

If we need a function which can take an arbitrary number of keyword arguments, we may use the following snytax:

def my_function(arg1, arg2, kwarg1=default1, kwarg2=default2, **other_kwargs):

# indented block of code

Then other_kwargs contains a dictionary of all keyword arguments passed to the function, but without kwarg1
and kwarg2.

11.2.5 Argument Unpacking

If we have a list or tuple and all items shall be passed as single arguments to a function, then we should use argument
unpacking:

my_list = [4, 3, 1]
some_function(*my_list)
some_function(my_list[0], my_list[1], my_list[2])

Both calls to some_function are equivalent.

Same works with keyword arguments and dictionaries, where keys are argument names and values are values to be
passed to the function.

my_dict = {'kwarg1': 3, 'kwarg2': 5, 'kwarg3': 100}

some_function(**my_dict)

11.2. Passing Arguments 129

Data Science and Artificial Intelligence for Undergraduates

11.2.6 Memory Management

Passing Mutable Objects

Python never copies objects passed to a function. Instead, the argument names in a function definition are tied to the
objects, whose names are given in the function call.

def some_function(arg1, arg2):

print(id(arg1), id(arg2))

a = 5
b = 'some string'

print(id(a), id(b))

some_function(a, b)

139792935403888 139792871198064
139792935403888 139792871198064

Here we have to take care: if we pass mutable objects to a function, then the function may modify these objects!

def clear_list(l):
for k in range(0, len(l)):
l[k] = 0

my_list = [2, 5, 3]

clear_list(my_list)

print(my_list)

[0, 0, 0]

Always look up a function’s documentation if you have to pass mutable objects to a function. If the function modifies
an object, this fact should be provided in the documentation. For instance, several functions of the OpenCV96 library,
which we’ll use later on, modify their arguments without proper documentation.

Mutable Default Values

A similar issue arises if we use mutable objects as default values for optional arguments. The name of an optional
argument is tied to the object only once during execution (at time of function definition). If this object gets modified,
then the default value changes for subsequent function calls.

def append42(l = []):

l. append(42)
print(l)

append42()
append42([1, 2, 3])
append42()
append42()

[42]
[1, 2, 3, 42]
(continues on next page)
96 https://opencv.org

130 Chapter 11. Functions

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

[42, 42]
[42, 42, 42]

To prevent such problems use a construction similar to the following:

def append42(l = None):

if l == None:
l = []
l. append(42)
print(l)

append42()
append42([1, 2, 3])
append42()
append42()

[42]
[1, 2, 3, 42]
[42]
[42]

11.2.7 Restricting Argument Passing

We may restrict a function’s arguments to one of the following types:

• positional only,
• positional and keyword,
• keyword only.
For this purpose we have to add / and * to the argument list in a function’s definition:

def my_function(pos1, pos2, /, poskw1, poskw2, *, kw1, kw2):

# indented block of code

In a call to the function the first group of arguments has to be passed without keyword, the second group may be
passed with or without keyword, and the third group has to be passed by keyword.
The reason for existence of this technique is quite involved and, presumably, we won’t need this feature. But we
should know it to understand code written by others.

11.2.8 Functions in Python’s Documentation

Flexibility of argument passing makes it hard to clearly document which variants a library function accepts. Python’s
documentation uses a special syntax to state type and number of arguments as well as default values of a function.
Example: The glob module’s glob function (see File IO (page 111)) is shown in Python’s documentation97 as
follows:

glob.glob(pathname, *, root_dir=None, dir_fd=None, recursive=False)

We see:
• pathname is the only positional argument.
• There are three arguments which have to be passed by keyword.
97 https://docs.python.org/3/library/glob.html#glob.glob

11.2. Passing Arguments 131

Data Science and Artificial Intelligence for Undergraduates

• pathname is mandatory, whereas the other arguments have default values.

Another example: The built-in input98 function.

input([prompt])

• There is only one argument.

• The argument is optional (indicated by [...]), that is, calling without any arguments is okay.
One more: The built-in print99 function.

print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)

• print accepts an arbitrary number of positional arguments.

• sep, end, file, flush have to be passed by keyword (because they follow a * argument).
• The four keyword arguments are optional (have default values).

11.3 Anonymous Functions (Lambdas)

Sometimes one has to call functions which take a function as argument. Passing a function as argument is very simple,
just give the function’s name as argument:

def my_function(arg1, arg2):

# some code

some_function(my_function)

Often, functions passed to other functions are needed only once in the code and almost always they have very simple
structure. Providing a full function definition and wasting a name for such throwaway functions, thus, should by
avoided. The tool for avoiding this overhead are anonymous functions, in Python known as lambdas. Here is an
example:

some_function(lambda arg1, arg2: SOME SHORT CODE)

The lambda keyword creates a function in the same way as def, but without assigning a name to it. Keyword
arguments are allowed, too. In principle it is possible to define named functions with lambda:

my_function = lambda arg1, arg2: SOME SHORT CODE

But this should be avoided to keep code readable.

11.4 Function and Method Objects

Everything is an object in Python, even functions and member functions.

def my_function():
print('Oh, you called me!')

print(type(my_function))
print(id(my_function))
print(dir(my_function))

98 https://docs.python.org/3/library/functions.html#input
99 https://docs.python.org/3/library/functions.html#print

132 Chapter 11. Functions

Data Science and Artificial Intelligence for Undergraduates

<class 'function'>
140438181160432
['__annotations__', '__builtins__', '__call__', '__class__', '__closure__', '__
↪code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__

↪eq', 'format', 'ge', 'get', 'getattribute', 'globals__',

↪'gt', 'hash', 'init', '__init_subclass', 'kwdefaults', '

↪le', 'lt', 'module', 'name', 'ne', 'new', 'qualname__

↪', 'reduce', '__reduce_ex', 'repr', 'setattr', 'sizeof', '

↪str__', '__subclasshook__']

print(my_function.__name__)

my_function

Member functions are bit special.

class my_class:
def some_method(self, some_string):
print('You called me with {}'.format(some_string))

my_object = my_class()

print(type(my_object.some_method))
print(type(my_class.some_method))

If the method my_object.some_method is called, then the Python interpreter inserts the owning object as first
argument and calls the corresponding function my_class.some_method. In other words, the following two lines
are equivalent:

my_object.some_method('Hello')
my_class.some_method(my_object, 'Hello')

You called me with Hello

11.5 Recursion

A useful programming technique is recursion. That is, a function calls itself until some stopping criterion is satisfied.
To illustrate this approach let’s have a list. Each item either is an integer or a list again. If it is a list, then each item
of this list is an integer or another list, and so on. The task is to calculate the sum of all integers. This can’t be solved
by nested for loops, because we do not know the depth of the list nesting in advance.

def sum_list(l):
''' Sum up list items recursively. '''

current_sum = 0

for k in l:
if type(k) == int:
current_sum += k
else:
(continues on next page)

11.5. Recursion 133

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

current_sum += sum_list(k)

return current_sum

a = [[1, 2, 3], 4, [5, [6, 7], 8]]

print(sum_list(a))

134 Chapter 11. Functions

CHAPTER

TWELVE

MODULES AND PACKAGES

We already met modules and packages in the Crash Course (page 43). Here we add the details and learn how to write
new modules and packages.

12.1 Importing Modules

To get access to the functionality of a module it has to be imported into the source code file or into the interactive
interpreter session:

import module_name

This creates a Python object with name module_name (everything is an object in Python!) whoes methods are
the functions defined on the module file. All names (functions, types, and so on) defined in the module then can be
accessed this way:

module_name.some_function()

Use the built-in function dir to get a list of all names defined in an imported module.

import datetime
print(dir(datetime))

['MAXYEAR', 'MINYEAR', 'all', 'builtins', 'cached', 'doc', '__

↪file__', '__loader__', '__name__', '__package__', '__spec__', 'date',

↪'datetime', 'datetime_CAPI', 'sys', 'time', 'timedelta', 'timezone', 'tzinfo']

A module’s name can appear very often in source code. The as keyword allows to abbreviate module names:

import module_name as mod

mod.some_function()

It is even possible to avoid typing module names at all. If only few functions of a module are needed, then they can
be imported directly:

from module_name import some_function, some_other_function

some_function()
some_other_function()

The as keyword can be used to abbreviate function names, too:

135
Data Science and Artificial Intelligence for Undergraduates

from module_name import some_function as func

func()

To directly import all functions from a module use *:

from module_name import *

some_function()

But be careful; modules may contain hundreds of functions and importing all these functions may slow down your
code.

Note: The import statement makes the Python interpreter look for a file module_name.py. If it cannot find a
built-in module with this name (that is, a module integrated directly into the interpreter), then it looks in the directory
containing the source code file. Then several other directories are taken into account. If the interpreter does not find
the requested module, an error message is shown.

12.2 Importing Packages

Packages are collections of modules and can be imported in the same way as modules. Packages may contain sub-
packages. An import statement could look like this:

import package_name

package_name.subpackage_name.module_name.some_function()

The import statement creates a tree of objects and subobjects which reflects the structure of the package. To import
only one subpackage or one module from a package, use from:

from package_name import subpackage_name

subpackage_name.module_name.some_function()

from package_name.subpackage_name import module_name

module_name.some_function()

The placeholder * and renaming with as are available, too.

12.3 The Python Standard Library

Python ships with a large number of modules and packages, known as Python standard library. Have a look at the
complete list100 of the standard library’s contents and also at Brief Tour of the Standard Library101 as well as Brief
Tour of the Standard Library - Part II102 .
We already introduced some of the standard library’s modules and packages (datetime and os.path for instance)
and we will continue to introduce new functionality when needed for our purposes.
100 https://docs.python.org/3/library/index.html
101 https://docs.python.org/3/tutorial/stdlib.html
102 https://docs.python.org/3/tutorial/stdlib2.html

136 Chapter 12. Modules and Packages

Data Science and Artificial Intelligence for Undergraduates

12.4 Writing New Modules

Writing our own module is very simple: Put function definitions in a file my_module.py and import it with im-
port my_module. Writing modules makes code reusable and increases readability.
When importing a module the Python interpreter executes the module file. If you want to use a Python source code
file as script as well as as module, you might check the value of the pre-defined variable __name__. If __name__
== '__main__', then the code is being executed as a script. If __name__ == 'module_name', then the
code is being run due to an import statement.
It’s also possible to use compiled modules103 .

12.5 Writing New Packages

Of course you can write your own packages. A package then is a directory package_name which contains the
Python files for all modules in the package and in addition a file __init__.py. This file might be empty, but is
required to mark a directory as Python package. Subpackages are subdirectories with __init__.py file.
For details see Python’s documentation104 .

12.6 Private Members

Python does not support hidden functions or variables in modules. Also hiding members of a class from the user of
the class is not possible. But sometimes this would be quite useful. Variables needed only for internal calculations
or little helper functions and methods shouldn’t be visible from outside, because, if they were hidden, then we could
change their name or remove them completely if desired without breaking source code which uses the module or
class.
In Python there is a convention to mark private members: a leading underscore like in _hid-
den_by_convention. Variables, functions and methods preceded by an underscore should not be accessed
or called from outside the module or class definition. But this is a convention. Nothing prevents you from violating
this convention.

103 https://docs.python.org/3/tutorial/modules.html#compiled-python-files
104 https://docs.python.org/3/tutorial/modules.html#packages

12.4. Writing New Modules 137

Data Science and Artificial Intelligence for Undergraduates

138 Chapter 12. Modules and Packages

CHAPTER

THIRTEEN

ERROR HANDLING AND DEBUGGING OVERVIEW

Up to now we did not care about error handling. If something went wrong, the Python interpreter stopped execution
and printed some message. But Python provides techniques for more controlled error handling.

13.1 Error Handling

13.1.1 Syntax Versus Runtime Errors

The Python interpreter parses the whole souce code file before execution. In this phase the interpreter may encounter
syntax errors. That is, the interpreter does not understand what we want him to do. The code does not look like
Python code should look like. Syntax errors are easily recovered by the programmer.
The more serious types of errors are runtime errors (or semantic errors) which occur during program execution.
Handling runtime errors is sometimes rather difficult.

13.1.2 Handling Runtime Errors

The traditional way for handling runtime errors is to avoid runtime errors at all. All user input and all other sources
of possible trouble get checked in advance by incorporating suitable if clauses in the code. This approach decreases
readability of code, because the important lines are hidden between lots of error checking routines.
The more pythonic way of handling runtime errors are exceptions. Everytime the interpreter encounters some prob-
lem, like division by zero, it throws an exception. The programmer may catch the exception and handle it appropriately
or the programmer may leave exception handling to the Python interpreter. In the latter case, the interpreter usually
stops execution and prints a detailed error message.

13.1.3 Basic Exception Handling Syntax

Here is the basic syntax for catching and handling exceptions:

try:
# code which may cause troubles
except ExceptionName:
# code for handling a certain exception caused by code in try block
except AnotherExceptionName:
# code for handling a certain exception caused by code in try block
else:
# code to execute after successfully finishing try block

The try block contains the code to be protected, that is, the code which might raise an exception. Then there is at
least one except block. The code in the except block is only executed, if the specified exception has been raised.
In this case, execution of the try block is stopped immediately and execution continues in the except block.

139
Data Science and Artificial Intelligence for Undergraduates

There can be several except blocks for handling different types of exceptions. Instead of an exception name also a
tuple of names can be given to handle several different exceptions in one block.
The else block is executed after successfully finishing the try block, that is, if no exception occurred. Here is
the right place for code which shall only be executed if no exception occurred, but for which no explicit exception
handling shall be implemented.
Here is an example:

a = 0 # some number from somewhere (e.g., user input)

try:
b = 1 / a
except ZeroDivisionError:
print('Division by zero. Setting result to 1000.')
b = 1000 # set b to some (reasonable) value
else:
print('Everything okay.')

print('Result is {}.'.format(b))

Division by zero. Setting result to 1000.

Result is 1000.

Without using exception handling the interpreter would stop execution in the division line. By catching the exception
we can avoid this automatic behavior and handle the problem in a way which does not prevent further program
execution.
Note that exception names are not strings, but names of object types (classes). Thus, don’t use quotation marks.

print(type(ZeroDivisionError))
print(dir(ZeroDivisionError))

<class 'type'>
['__cause__', '__class__', '__context__', '__delattr__', '__dict__', '__dir__',
↪'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__

↪hash', 'init', 'init_subclass', 'le', 'lt', 'ne', '

↪new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setstate_

↪_', 'sizeof', 'str', 'subclasshook', '__suppress_context', '

↪traceback__', 'args', 'with_traceback']

The Python documentation contains a list of built-in exceptions105 . There is a kind of tree structure in the set of all
exceptions and we may define new exceptions if we need them to express errors specific to our program. These topics
will be discussed in detail when delving deeper into object oriented programming.

13.1.4 Clean-Up

Sometimes it’s necessary to do some clean-up operations like closing a file no matter an exception occurred or not
while keeping the file open for reading and writing. For this purpose Python provides the finally keyword:

try:
# code which may cause troubles
except ExceptionName:
# code for handling a certain exception caused by code in try block
else:
# code to execute after successfully finishing try block
finally:
# code for clean-up operations

105 https://docs.python.org/3/library/exceptions.html#concrete-exceptions

140 Chapter 13. Error Handling and Debugging Overview

Data Science and Artificial Intelligence for Undergraduates

The finally block is executed after the try block (if there is no else block) or after the else block if no
exception occured. If an exception occurred, then the finally block is executed after the corresponding except
clause. If try or except clauses contain break, continue or return, then the finally block is executed
before break, continue or return, respectively. If a finally block executed before return contains a
return itself, then finally’s return is used and the original return is ignored.

Note: As long as a file is opened by our program the operating system blocks file access for other programs. Thus,
we should close a file as soon as possible. Forgetting to close a file is not too bad because the OS will close it for use
after program execution stopped. But for long running programs with only short file access at start-up a non-closed
file may block access by other programs for hours or days. Thus, always, especially in case of exception handling,
make sure that in each situation (with or without exception) files get closed properly by the program.

13.1.5 Objects With Predefined Clean-Up Actions

Some object types, file objects for instance, include predefined clean-up actions. That is, for certain operations (e.g.,
opening a file) they define what should be done in a corresponding finally block (e.g., closing the file), if the
operations would be placed in a try block.
To use this feature Python has the with keyword:

with open('some_file') as f:
# do something with file object f

If the open function is successful, then the indented code block is executed. If open fails, an exception is raised.
In both cases, with ensures, that proper clean-up (closing the file) takes place.
Objects which can be used with with are said to support the context management protocol. Such objects can also be
defined by the programmer using dunder methods, see Python’s documentation106 for details.
The purpose of with is to make code more readable by avoiding too many try...except...finally blocks.

13.2 Logging and Debugging

Up to now we considered syntax errors, which basically are typos in the code, and semantic errors, which are caused
by unexpected user input or failed file access. But code may contain more involved semantic errors, which may be
hard to identify. The process of finding and correcting semantic errors is known as debugging.
A simple approach to debugging is to print status information during program flow. For private scripts and a data
scientist’s everyday use this suffices. For higher quality programs the Python standard library provides the logging
package, which allows to redirect some of the status information to a log file. Logging basics are described in the
basic logging tutorial107 .
If looking at log messages does not suffice, there are programs specialized to debugging your code. We do not cover
this topic here. But if you are interested in you should have a look at The Python Debugger108 and at Debugging with
Spyder109 .
106 https://docs.python.org/3/library/stdtypes.html#context-manager-types
107 https://docs.python.org/3/howto/logging.html#basic-logging-tutorial
108 https://docs.python.org/3/library/pdb.html
109 https://docs.spyder-ide.org/debugging.html

13.2. Logging and Debugging 141

Data Science and Artificial Intelligence for Undergraduates

13.3 Profiling

Sometimes our code does what you want it to do, but it is too slow or consumes too much memory (out of memory
error from the operating system). Then it’s time for profiling.
You may use the Spyder Profiler110 or import profiling functionality from suitable Python packages.

13.3.1 Profiling Execution Time

The timeit module provides tools for measuring a Python script’s execution time in seconds.

import timeit

a = 1.23

code = """\
b = 4.56 * a
"""

timeit.timeit(stmt=code, number=1000000, globals=globals())

0.06846186006441712

This code snipped packs some code into the string code and passes it to the timeit function. This function
executes the code number times to increase accuracy. The built-in function globals returns a list of all defined
names. This list should be passed to the timeit function to provide access to all names.
Have a look at the The Python Profilers111 , too.

Note: If working in Jupyter you may use the %timeit and %%timeit112 magics instead of the timeit module,
the former for timing one line of code (%timeit one_line_of_code), the latter for timing the whole code
cell (place it in the cell’s first line).

13.3.2 Profiling Memory Consumption

From data science view also memory consumption is of interest, because handling large data sets requires lots of
memory. There are many ways to obtain memory information. A simple one is as follows (install module pympler
first):

from pympler import asizeof

my_string = 'This is a string.'

my_int = 23

print(asizeof.asizeof(my_string))
print(asizeof.asizeof(my_int))

72
32

This gives the size of the memory allocated for some object. This number also includes the size of ‘subobjects’, that
is, for example, all the objects referenced by a list object are included.
110 https://docs.spyder-ide.org/profiler.html
111 https://docs.python.org/3/library/profile.html
112 https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit

142 Chapter 13. Error Handling and Debugging Overview

CHAPTER

FOURTEEN

INHERITANCE

Inheritance is an important principle in object-oriented programming. Although we may reach our aims without using
inheritance in our own code, it’s important to know the concept and corresponding syntax constructs to understand
other people’s code. We’ll meet inheritance related code in
• the documentation of modules and packages,
• when customizing and extending library code.
Related exercises: Object-Oriented Programming (page 267).

14.1 Principles of Object-Oriented Programming

Up to now we only considered three of four fundamental OOP principles:

• encapsulation: code is structured into small units (write functions for different tasks and group them together
with relevant data into objects)
• abstraction:
– similar tasks are handled in one and the same way (instead of several unrelated objects we define a class
and instantiate several objects sharing the same interface)
– implementation details are hidden behind interfaces (a class defines an interface by providing methods
and (non-private) member variables; concrete implementation of methods and usage of member variables
are not of importance and invisible from outside)
• polymorphism (functions and method accept different sets and types of arguments, that is, interfaces are flexible;
something a Python programmer does not care about because it’s a very native Python feature, in contrast to
C/C++, for instance)
The missing principle is
• inheritance (create new classes by extending existing classes)

14.2 Idea and Syntax

Inheritance is a technique to create new classes by extending and/or modifying existing ones. A new class may have
a base class. The new class inherits all methods and member variables from its base class and is allowed to replace
some of the methods and to add new ones. Syntax:

class NewClass(BaseClass):

def additional_method(self, args):

# do something

def replacement_for_base_class_method(self, args):

# do something

143
Data Science and Artificial Intelligence for Undergraduates

The only difference compared to usual class definitions is in the first line, where a base class can be specified. Defining
methods works as before. If the method name does not exist in the base class, then a new method is created. If it
already exists in the base class, the new one is used instead of the base class’ method. In addition to explicitly defined
methods, the new class inherits all methods from the base class.
Inheritance saves time for implementation and leads to a well structured class hierachy. Object-oriented program-
ming is not solely about defining classes (encapsulation and abstraction), but also about defining meaningful relations
between classes, thus, to some extent mapping real world to source code.

14.3 Example

Real-life examples of inheritance often are quite involved. For illustration we use a pathological example resampling
relations between geometric objects.
Imagine a vector drawing program. Each geometric object shall be represented as object of a corresponding class.
Say quadrangles are objects of type Quad, paraxial rectangles are objects of type ParRect and so on. Let’s start
with class Point:

class Point:
''' represent a geometric point in two dimensions '''

def init(self, x, y):

self.x = x
self.y = y

def __str__(self):
return f'({self.x}, {self.y})'

Now we define Quad:

class Quad:
''' represent a quadrangle '''

def init(self, a, b, c, d):

''' make a quad from 4 Point objects '''
self._a = a
self._b = b
self._c = c
self._d = d

def get_points(self):
return (self._a, self._b, self._c, self._d)

def __str__(self):
return f'quadrangle with points' \
f'({self._a.x}, {self._a.y}), ({self._b.x}, {self._b.x}), ' \
f'({self._c.x}, {self._c.x}), ({self._d.x}, {self._d.x})'

The member variables _a, _b, _c, _d are hidden since we consider them implementation details. If the user wants
access to the four points making the quadrangle, get_points should be called. This way we are free to store the
quadrangle in a different format if it seems resonable in future when extending class’ functionality. This is a design
decision and is in no way related to inheritance.
Here comes ParRect: Note that a paraxial rectangle is defined by two Points.

class ParRect(Quad):

def init(self, a, c):

''' make a paraxial rect from two points '''
(continues on next page)

144 Chapter 14. Inheritance

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

b = Point(c.x, a.y)
d = Point(a.x, c.y)
super().__init__(a, b, c, d)

def __str__(self):
return f'paraxial rect with points ({self._a.x}, {self._a.y}), ({self._c.
↪x}, {self._c.x})'

def area(self):
''' return the rect's area '''
return abs(self._b.x - self._a.x) * abs(self._d.y - self._a.y)

The ParRect class inherits everything from Quad. It has a new constructor with fewer arguments than in Quad,
but calls the constructor of Quad.

Important: The built-in function super in principle returns self (that is, the current object), but redirects method
calls to the base class.

We reimplement __str__ and add the new method area. Note that ParRect objects have member variables
_a, _b, _c, _d since those are created by the Quad constructor we call in the ParRect constructor. Also the
get_points method is a member of ParRect since it gets inherited from Quad.

parrect = ParRect(Point(0, 0), Point(2, 1))

print(parrect)

print('area: {}'.format(parrect.area()))

a, b, c, d = parrect.get_points()
print('all points:', a, b, c, d)

paraxial rect with points (0, 0), (2, 2)

area: 2
all points: (0, 0) (2, 0) (2, 1) (0, 1)

14.4 Type Checking

Note that isinstance also returns True if we check against a base class of an object’s class. In other words, each
object is an instance of its class and of all base classes.

print(isinstance(parrect, ParRect))
print(isinstance(parrect, Quad))

True
True

In contrast, type checking with type returns False if checked against the base class:

print(type(parrect) == ParRect)
print(type(parrect) == Quad)

True
False

14.4. Type Checking 145

Data Science and Artificial Intelligence for Undergraduates

14.5 Every Class is a Subclass of object

In Python there is a built-in class object and every newly created class automatically becomes a subclass of ob-
ject. The line

class my_new_class:

is equivalent to

class my_new_class(object):

and also to

class my_new_class():

by the way.
To see this in code we might use the built-in function issubclass. This function returns True if the first argument
is a subclass of the second.

class MyClass:

def __init__(self):
print('Here is __init__()!')

print(issubclass(MyClass, object))

True

Alternatively, we may have a look at the __base__ member variable, which stores the base class:

print(MyClass.__base__)

Objects of type object do not have real functionality. The object class provides some auxiliary stuff used by
the Python interpreter for managing classes and objects.

obj = object()
dir(obj)

['__class__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__ne__',
'__new__',
(continues on next page)

146 Chapter 14. Inheritance

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__']

14.6 Virtual Methods

Python does not allow for directly implementing so called virtual methods. A virtual method is a method in a base class
which has to be (re-)implemented by each subclass. The typical situation is as follows: The base class implements
some functionality, which for some reason has to call a method of a subclass. How to guarantee that the subclass
provides the required method?
In Python a virtual method is a usual method which raises a NotImplementedError, a special exception type
like ZeroDivisionError and so on. If everything is correct, this never happens, because the subclass overrides
the base class’ method. But if the creator of the subclass forgets to implement the method required by the base class,
an error message will be shown.

14.7 Multiple Inheritance

A class may have several base classes. Just provide a tuple of base classes in the class definition:

class my_class(base1, base2, base3):

The new class inherits everything from all its base classes.
If two base classes provide methods with identical names, the Python interpreter has to decide which one to use for
the new class. There is a well-defined algorithm for this decision. If you need this knowledge someday, watch out for
method resolution order (MRO).

14.8 Exceptions Inherit from Exception

Up to now we used built-in exceptions only, like ZeroDivisionError. But now we have gathered enough
knowledge to define new exceptions. Exeptions are classes as we noted before. Each exception is a direct or indirect
subclass of BaseException. Almost all exceptions also are a subclass of Exception, which itself is a direct
subclass of BaseException. See Exception hierarchy113 for exceptions’ genealogy.
If we want to introduce a new exception, we have to create a new subclass of Exception.

class SomeError(Exception):

def init(self, message):

self.message = message

def my_function():
print('I do something...')
raise SomeError('Meaty error message!!!')

print('Entering my_function...')
(continues on next page)
113 https://docs.python.org/3/library/exceptions.html#exception-hierarchy

14.6. Virtual Methods 147

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

try:
my_function()
except SomeError as error:
print('Exception SomeError: {}'.format(error.message))

Entering my_function...
I do something...
Exception SomeError: Meaty error message!!!

At first we define a new exception class SomeError. The constructor takes an error message and stores it in the
member variable message. The function my_function raises SomeError. The main program catches this
exception and prints the error message. The as keyword provides access to a concrete SomeError object containing
the error message.
Note that except SomeBaseClass also catches all subclasses of SomeBaseClass. If we want to handle a
subclass exception separately we have to place its except line above the base class’s except line. Contrary, a
subclass except never handles a base class exception.

148 Chapter 14. Inheritance

CHAPTER

FIFTEEN

FURTHER PYTHON FEATURES

Python provides much more features than we need. Here we list some of them, which might be of interest either
because they simplify some coding tasks or because they frequently occur in other people’s code.

15.1 Doing Nothing With pass

Python does not allow emtpy functions, classes, loops and so on. But there is a ‘do nothing’ command, the pass114
keyword.

def do_nothing():

pass

do_nothing()

15.2 Checking Conditions With assert

Especially for debugging purposes placing multiple if statements can be avoided by using assert115 instead. The
condition following assert is evaluated. If the result is False an AssertionError is raised, else nothing
happens. An optional error message is possible, too.

a = 0 # some number from somewhere (e.g., user input)

assert a != 0, 'Zero is not allowed!'

b = 1 / 0

---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [2], in <cell line: 3>()
1 a = 0 # some number from somewhere (e.g., user input)
----> 3 assert a != 0, 'Zero is not allowed!'
5 b = 1 / 0

AssertionError: Zero is not allowed!

114 https://docs.python.org/3/reference/simple_stmts.html#the-pass-statement
115 https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement

149
Data Science and Artificial Intelligence for Undergraduates

15.3 Structural Pattern Matching

Python 3.10 introduces two new keywords: match and case. In the simplest case they can be used to replace
if...elif...elif...else constructs for discrimating many cases. But in addition they provide a powerful
pattern matching mechanism. See PEP 636 - Structural Pattern Matching: Tutorial116 for an introduction.

15.4 The set Type

Python knows a data type for representing (mathematical) sets. Its name is set. Typical operations like unions and
intersections of sets are supported. The official Python Tutorial117 shows basic usage.

15.5 Function Decorators

Function decorators are syntactic suggar (that is, not really needed), but very common. They precede a function
definition and consist of an @ character and a function name. Function decorators are used to modify a function by
applying another function to it. The following two code cells are more or less equivalent:

def double(func):
return lambda x: 2 * func(x)

@double
def calc_something(x):
return x * x

def double(func):
return lambda x: 2 * func(x)

def calc_something(x):
return x * x

calc_something = double(calc_something)

See Function definitions118 in Python’s documention for details.

15.6 The copy Module

The copy module119 provides functions for shallow and deep copying of objects. For discussion of the copy problem
in the context of lists see Multiple Names and Copies (page 96).
116 https://peps.python.org/pep-0636/
117 https://docs.python.org/3/tutorial/datastructures.html#sets
118 https://docs.python.org/3/reference/compound_stmts.html#function-definitions
119 https://docs.python.org/3/library/copy.html

150 Chapter 15. Further Python Features

Data Science and Artificial Intelligence for Undergraduates

15.7 Multitasking

Reading in large data files or downloading data from some server may take a while. During this time the CPU is
more or less idle and could do some heavy computations without slowing down data transfer. This a typical situation
where one wants to have two Python programs or two parts of one and the same program running in parallel, possibly
communicating with each other.
Python has the threading120 module for real multitasking on operating system level. The asynio121 module in
combination with the async and await keywords provides a simpler multithreading approach completely controlled
by the Python interpreter.

15.8 Graphical User Interfaces (GUIs)

The Python ecosystem provides lots of packages for creating and controlling graphical user interfaces (GUIs). Here
are some widely used ones:
• tkinter122 in Python’s standard library provides support for classical desktop applications with a main win-
dow, subwindows, buttons, text fields, and so on.
• ipywidgets123 is a very easy to use package for creating graphical user interfaces in Jupyter notebooks. For
instance a slider widget could control some parameter of an algorithm.
• flask124 is a package for building web apps with Python, that is, the user interacts via a website with the
Python program running on a server.

120 https://docs.python.org/3/library/threading.html
121 https://docs.python.org/3/library/asyncio.html
122 https://docs.python.org/3/library/tkinter.html
123 https://ipywidgets.readthedocs.io
124 https://flask.palletsprojects.com

15.7. Multitasking 151

Data Science and Artificial Intelligence for Undergraduates

152 Chapter 15. Further Python Features

Part IV

Managing Data with Python

153
CHAPTER

SIXTEEN

EFFICIENT COMPUTATIONS WITH NUMPY

Almost all data comes as tables with lots of numbers. NumPy125 is a Python package for effciently handling large
tables of numbers. NumPy also provides advanced and very efficient linear algebra operations on such tables, for
instance solving systems of linear equations based on the data. Most machine learning algorithms boil down to
moving large amounts of data and doing some linear algebra. Thus, it’s a good idea to spend some time understanding
NumPy’s basic principles and discovering NumPy’s functionality.
• NumPy Arrays (page 155)
• Array Operations (page 161)
• Advanced Indexing (page 165)
• Vectorization (page 166)
• Array Manipulation Functions (page 168)
• Copies and Views (page 171)
• Efficiency Considerations (page 173)
• Special Floats (page 175)
• Linear Algebra Functions (page 177)
• Random Numbers (page 179)
Related exercises:
• NumPy Basics (page 269)
• Image Processing with NumPy (page 271)

16.1 NumPy Arrays

NumPy provides two fundamental tools for data science purposes:

• a data type for storing tabular numerical data,
• very efficient functions for computations on large amounts of numerical data.
NumPy’s basic data type is called ndarray (n-dimensional array), often called NumPy array.
NumPy’s standard abbreviation is np.

import numpy as np

125 https://numpy.org

155
Data Science and Artificial Intelligence for Undergraduates

16.1.1 Python Lists versus NumPy Arrays

From mathematics we know Vectors (page 333) and Matrices (page 334). A vector is a (one-dimensional) list of
numbers. A matrix is a (two-dimensional) field of numbers. Vectors could be represented by lists in Python, whereas
a matrix would be a list if lists (a list of rows or a list of columns).
Using Python lists for representing large vectors and matrices is very inefficient. Each item of a Python list has its
own location somewhere in memory. When reading a whole list, to multiply a vector by some number, for instance,
Python reads the first list item, then looks for the memory location of the second, then reads the second, and so on.
A lot of memory management is involved.
To significantly improve performance, NumPy provides the ndarray data type. The most important property of
an ndarray is its dimension. A one-dimensional array stores a vector. A two-dimensional array stores a matrix.
Zero-dimensional arrays store nothing, but are valid Python objects. Visualization of arrays with dimension above
two is somewhat difficult. A three-dimensional array can be visualized as cuboid of numbers, each number described
by three indices (row, column, depth level). We will meet dimensions of three and above almost every day when
diving into machine learning. One example are color images: two-dimensions for pixel positions, one dimension for
color channels (red, green, blue, transparency).
Why are NumPy arrays more efficient?
• All items of a NumPy array have to have identical data type, mostly float or integer. This saves time and
memory for handling different types and type conversions.
• All items of a NumPy array are stored in a well-structured contiguous block of memory. To find the next
item or to copy a whole array or part of it much less memory management operations are required.
• NumPy provides optimized mathematical operations for vectors and matrices. Instead of processing arrays
item by item, NumPy functions take the whole array and process it in compiled C code. Thus, the item-by-item
part is not done by the (slow) Python interpreter, but by (very fast) compiled code.

16.1.2 Creating NumPy Arrays

Converting Python Lists to Arrays

There are several ways to create NumPy arrays. We start with conversion of Python lists or tuples by NumPy’s
array function.
Passing a list or a tuple to array yields a one-dimensional ndarray. The data type is determined by NumPy to
be the simplest type which can hold all objects in the list or tuple.

a = np.array([23, 42, 7, 4, -2])

print(a)
print(a.dtype)

[23 42 7 4 -2]
int64

The member variable ndarray.dtype contains the array’s data type. Here NumPy decided to use int64, that
is, integers of length 8 byte. Available types will be discussed below. An example with floats:

b = np.array([2.3, 4.2, 7, 4, -2])

print(b)
print(b.dtype)

[ 2.3 4.2 7. 4. -2. ]

float64

156 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

Important: NumPy ships with its own data types for numbers to allow for more efficient storage and computations.
Python’s int type allows for arbitrarily large numbers, whereas NumPy has different types for integers with different
(and finite!) numerical ranges. NumPy also knows several types of floats differing in precision (number of decimal
places) and range. Wherever possible conversion between Python types and NumPy types is done automatically.

Higher-Dimensional Arrays from Lists

To get higher-dimensional arrays use nested lists:

c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(c)

[[1 2 3]
[4 5 6]
[7 8 9]]

New Arrays as Return Values

Next to explicit creation, NumPy arrays may be the result of mathematical operations:

d = a + b

print(type(d))

To see that d is indeed a new ndarray and not an in-place modified a or b, we might look at the object ids, which
are all different:

print(id(a), id(b), id(d))

140549050909872 140549364309744 140549050912656

Functions for Creating Special Arrays

A third way for creating NumPy arrays is to call specific NumPy functions returning new arrays. From np.zeros
we get an array of zeros. From np.ones we get an array of ones. There are much more functions like zeros and
ones, see Array creation routines126 in Numpy’s documentation.

a = np.zeros(5)
b = np.ones((2, 3))

print(a, '\n')
print(b)

[0. 0. 0. 0. 0.]

[[1. 1. 1.]
[1. 1. 1.]]

126 https://numpy.org/doc/stable/reference/routines.array-creation.html#routines-array-creation

16.1. NumPy Arrays 157

Data Science and Artificial Intelligence for Undergraduates

NumPy almost always defaults to floats if no data type is explicitly provided.

16.1.3 Properties of NumPy Arrays

Objects of type ndarray have several member variables containing important information about the array:
• ndim: number of dimensions,
• shape: tuple of length ndim with array size in each dimension,
• size: total number of elements,
• nbytes: number of bytes occupied by the array elements,
• dtype: the array’s data type.

a = np.zeros((4, 3))

print(a.ndim)
print(a.shape)
print(a.size) # 4 * 3
print(a.nbytes) # 4 * 3 * 8
print(a.dtype)

2
(4, 3)
12
96
float64

It’s important to know that shape matters. In mathematics almost always we identify vectors with matrices having
only one column. But in NumPy these are two different things. A vector has shape (n, ), that is ndim is 1, whereas
a one-column matrix has shape (n, 1) with ndim of 2. Consequently, a vector neither is a row nor a column in
NumPy. It’s simply a list of numbers, nothing more.

a = np.zeros(5)
b = np.zeros((5, 1))
c = np.zeros((1, 5))

print(a, '\n')
print(b, '\n')
print(c)

[0. 0. 0. 0. 0.]

[[0.]
[0.]
[0.]
[0.]
[0.]]

[[0. 0. 0. 0. 0.]]

158 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.1.4 List-Like Indexing

Elements of NumPy arrays can be accessed similarly to items of Python lists. That is, the first item in a one-
dimensional ndarray has index 0 and the last one has index ndarray.size. Slicing is allowed, too.

a = np.array([23, 42, -7, 3, 10])

print(a[0], '\n')
print(a[1:3], '\n')
print(a[::2])

[42 -7]

[23 -7 10]

In case of multi-dimensional arrays we have to provide an index for each dimension. Slicing is done per dimension.

a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(a, '\n')

print(a[0, 1], '\n')

print(a[1:3, 0:2], '\n')
print(a[::2, ::2], '\n')
print(a[1, :])

[[1 2 3]
[4 5 6]
[7 8 9]]

[[4 5]
[7 8]]

[[1 3]
[7 9]]

[4 5 6]

Here : stands for ‘all indices of the dimension’.

Note: Selecting all elements in the last dimensions like in a[1, :] can be abbreviated to a[1]. Same holds for
higher dimensions: a[1, 3, :, :] is equivalent to a[1, 3]. The drawback is that one doesn’t see immediately
the array’s dimensionality.

16.1. NumPy Arrays 159

Data Science and Artificial Intelligence for Undergraduates

16.1.5 Data Types

NumPy knows many different numerical data types. Often we do not have to care about types (NumPy will choose
suitable ones), but sometimes we have to specify data types explicitly (see examples below).
Almost all NumPy functions accept the keyword argument dtype to specify the data type of the function’s return
value. Either pass a string with the desired type’s name or pass a type object. Passing Python types like int makes
NumPy choose the most appropriate NumPy type (here, np.int64 or the string 'int64').

a = np.zeros((2, 3))
b = np.zeros((2, 3), dtype=np.int64)

print(a, '\n')
print(b)

[[0. 0. 0.]
[0. 0. 0.]]

[[0 0 0]
[0 0 0]]

NumPy types for integers:

• np.int8, np.int16, np.int32, np.int64 (signed integers of different range),
• np.uint8, np.uint16, np.uint32, np.uint64 (unsigned integers of different range),
NumPy types for floats:
• np.float16, np.float32, np.float64 (different precision and range)
For booleans there is np.bool8, which is very similar to Pythons bool (both using 8 times as much memory as
required).
Types for complex numbers are available, too. See Built-in scalar types127 in NumPy’s documentation for details.

Hint: The dtype member of ndarrays and the dtype argument to NumPy functions carry more information
than the bare type (e.g., ‘signed integer of length 64 bits’). They also contain information about how data is organized
in memory. This is important for efficient import of data from external sources. Details will be discussed in Saving
and Loading Non-Standard Data (page 181).

Example: Saving Memory by Manually Choosing Types

Working with large NumPy arrays we have to save memory wherever possible. One important ingredient for memory
efficiency is choosing small types, that is, types with small range. Often we work with arrays of zeros and ones or of
small integers only. Then we should choose the smallest integer type:

a = np.ones(1000) # defaults to np.int64

b = np.ones(1000, dtype=np.int8)

print(f'a has {a.nbytes} bytes')

print(f'b has {b.nbytes} bytes')

a has 8000 bytes

b has 1000 bytes

Having a data set with one billion numbers choosing the correct type decides about requiring 1 GB or 8 GB of
memory!
127 https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types

160 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

Example: Unsuitable Default Type

Creating an array without explicitly providing a data type makes NumPy choose np.int64 or np.float64
depending on the presence of floats. This may lead to hard to find errors:

a = np.array([2, 4, 6, 1]) # defaults to np.int64

a[0] = 1.23
a[3] = 0.99

print(a)

[1 4 6 0]

Modifying values in integer arrays converts the new values to the array’s data type, even if information will be lost.
To avoid such errors always specify types if working with floats!

16.2 Array Operations

import numpy as np

All Python operators can be applied to NumPy arrays, where all operations work elementwise.

16.2.1 Mathematical Operations

For instance, we can easily add two vectors or two matrices by using the + operator.

a = np.array([1, 2, 3])
b = np.array([9, 8, 7])

c = a + b

print(c)

[10 10 10]

Important: Because all operations work elementwise, the * operator on two-dimensional arrays does NOT multiply
the two matrices in the mathematical sense, but simply yields the matrix of elementwise products. Mathematical
matrix multiplication will be discussed in Linear Algebra Functions (page 177).

NumPy reimplements almost all mathematical functions, like sine, cosine and so on. NumPy’s functions take arrays as
arguments and apply the mathematical functions elementwise. Have a look at Mathematical functions128 in NumPy’s
documentation for a list of available functions.

a = np.array([1, 2 , 3])

b = np.sin(a)

print(b)

[0.84147098 0.90929743 0.14112001]

128 https://numpy.org/doc/stable/reference/routines.math.html

16.2. Array Operations 161

Data Science and Artificial Intelligence for Undergraduates

Hint: Functions min and amin are equivalent. The amin variant exists to avoid confusion and name conflicts with
Python’s built-in function min. Writing np.min is okay.

Important: Functions np.min and np.minimum do different things. With min we get the smallest value in an
array, whereas minimum yields the elementwise minimum of two equally sized arrays. With np.argmin we get
the index (not the value) of the minimal element of an array.

16.2.2 Comparing Arrays

Comparisons work elementwise, too.

a = np.array([1, 2, 3])
b = np.array([-1, 3, 3])

print(a > b)

[ True False False]

Comparisons result in NumPy arrays of data type bool. The function np.any returns True if and only if at least
one item of the argument is True. The function np.all returns True if and only if all items of the argument are
True.

a = np.array([True, True, False])

print(np.any(a))
print(np.all(a))

True
False

Some of NumPy’s functions also are accessible as methods of ndarray objects. Examples are any and all:

a = np.array([True, True, False])

print(a.any())
print(a.all())

True
False

Due to Python’s internal workings for processing logical expressions efficiently it’s not possible to redefine and
and friends via dunder methods. Thus, there’s no chance for NumPy to define it’s own variant of and. There is
a dunder method __and__, but that implements bitwise ‘and’ (Python operator &), which is something different
than logical ‘and’.

To combine several conditions you might use logical_and129 and friends. Using Python’s and (and friends)
results in an error, because Python tries to convert a NumPy array to a bool value and it’s not clear how to do this
(any or all?).
129 https://numpy.org/doc/stable/reference/generated/numpy.logical_and.html

162 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.2.3 Broadcasting

If dimensions of the operands of a binary operation do not fit (short vector plus long vector, for instance) an exception
is raised. But in some cases NumPy uses a technique called broadcasting to make dimensions fit by cloning suitable
subarrays. Examples:

a = np.array([[1, 2, 3]]) # 1 x 3
b = np.ones((4, 3)) # 4 x 3

c = a + b

print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)

[[1 2 3]]

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

[[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]]
(4, 3)

a = np.array([[1, 2, 3]]) # 1 x 3
b = np.array([[1], [2], [3], [4]]) # 4 x 1

c = a + b

print(a, '\n')
print(b, '\n')
print(c, '\n')
print(c.shape)

[[1 2 3]]

[[1]
[2]
[3]
[4]]

[[2 3 4]
[3 4 5]
[4 5 6]
[5 6 7]]

(4, 3)

a = np.array([1, 2, 3]) # 3 (one-dimensional)

b = np.ones((4, 3)) # 4 x 3

c = a + b

print(a, '\n')
(continues on next page)

16.2. Array Operations 163

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

print(b, '\n')
print(c)
print(c.shape)

[1 2 3]

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

[[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]]
(4, 3)

Broadcasting follows two simple rules:

• If one array has fewer dimensions than the other, add dimensions of size 1 till dimensions are equal.
• Compare array sizes in each dimension. If they equal, do nothing. If they differ and one array has size 1 in the
dimension under consideration, clone the array with size 1 sufficiently often to fit the other array’s size.
Broadcasting makes life much easier. On the one hand it allows for operations where one operand is a scalar:

a = np.array([[1, 2, 3], [4, 5, 6]]) # 2 x 3

b = 7 # 1 (one-dimensional)

c = a + b

print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)

[[1 2 3]
[4 5 6]]

[[ 8 9 10]
[11 12 13]]
(2, 3)

On the other hand broadcasting allows for efficient column or row operations:

# multiply columns by different values

a = np.array([[1, 2, 3], [4, 5, 6]]) # 2 x 3

b = np.array([[0.5], [2]]) # 2 x 1

c = a * b

print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)

164 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

[[1 2 3]
[4 5 6]]

[[0.5]
[2. ]]

[[ 0.5 1. 1.5]
[ 8. 10. 12. ]]
(2, 3)

For higher-dimensional examples have a look at Broadcasting130 in NumPy’s documentation.

16.3 Advanced Indexing

NumPy supports different indexing techniques for accessing subarrays. We already discussed list-like indexing. Now
we add boolean and integer indexing.

import numpy as np

16.3.1 Boolean Indexing

If the index to an array is a boolean array of the same shape as the indexed array, then a one-dimensional array of all
items where the index is True is returned.

a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

idx = np.array([[True, True, False], [False, True, True], [True, False, False]])

b = a[idx]

print(a, '\n')
print(idx, '\n')
print(b)

[[1 2 3]
[4 5 6]
[7 8 9]]

[[ True True False]

[False True True]
[ True False False]]

[1 2 5 6 7]

A typical use case are elementwise bounds:

a = np.array([1, 4, 3, 5, 7, 6, 3, 2, 4, 5, 6, 7, 4, 1, 9])

b = a[a > 3]

print(b)

[4 5 7 6 4 5 6 7 4 9]

130 https://numpy.org/doc/stable/user/basics.broadcasting.html

16.3. Advanced Indexing 165

Data Science and Artificial Intelligence for Undergraduates

Here b is an array containing all numbers of a which are greater than 3. The comparison a > 3 returns a boolean
array of the same shape as a. Note, that broadcasting is used to compare an array to a number. The resulting boolean
array then is used as index to a.

16.3.2 Integer Indexing

Given an array we may provide an array of indices. The result of the corresponding indexing operation is an array of
the same size as the index array, but with the items of the indexed array at corresponding positions.

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx = np.array([[0, 5], [2, 3]])

print(a[idx])

[[1 6]
[3 4]]

For indexing multi-dimensional arrays we need multiple index arrays (one per dimension).

a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

idx0 = np.array([[0, 0], [2, 2]])
idx1 = np.array([[0, 2], [1, 0]])

print(a[idx0, idx1])

[[1 3]
[8 7]]

16.4 Vectorization

NumPy is very fast if operations are applied to whole arrays instead of element-by-element. Thus, we should try to
avoid iterating through array elements and processing single elements. This idea is known as vectorization.

import numpy as np

16.4.1 Example: Vectorization via Indexing

Imagine we have a vector of length 𝑛, where 𝑛 is even. We would like to interchange each number at an even index
with its successor. The result shall be stored in a new array.
Here is the code based on a loop:

def interchange_loop(a):
result = np.empty_like(a)
for k in range(0, int(a.size / 2)):
result[2 * k] = a[2 * k + 1]
result[2 * k + 1] = a[2 * k]
return result

print(interchange_loop(np.array([1, 2, 3, 4, 5, 6, 7, 8])))

[2 1 4 3 6 5 8 7]

And here with vectorization:

166 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

def interchange_vectorized(a):
result = np.empty_like(a)
result[0::2] = a[1::2]
result[1::2] = a[0::2]
return result

print(interchange_vectorized(np.array([1, 2, 3, 4, 5, 6, 7, 8])))

[2 1 4 3 6 5 8 7]

Now let’s look at the execution times:

%%timeit
interchange_loop(np.zeros(1000))

178 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
interchange_vectorized(np.zeros(1000))

2.96 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

The speed-up is of a factor of almost 75 for 𝑛 = 1000 and becomes much better if 𝑛 grows.

16.4.2 Example: Vectorization via Broadcasting

Given two lists of numbers we want to have a two-dimensional array containing all products from these numbers.
Here is the code based on a loop:

def products_loop(a, b):

result = np.empty((len(a), len(b)))
for i in range(0, len(a)):
for j in range(0, len(b)):
result[i, j] = a[i] * b[j]
return result

print(products_loop(range(1000), range(1000)))

[[0.00000e+00 0.00000e+00 0.00000e+00 ... 0.00000e+00 0.00000e+00

0.00000e+00]
[0.00000e+00 1.00000e+00 2.00000e+00 ... 9.97000e+02 9.98000e+02
9.99000e+02]
[0.00000e+00 2.00000e+00 4.00000e+00 ... 1.99400e+03 1.99600e+03
1.99800e+03]
...
[0.00000e+00 9.97000e+02 1.99400e+03 ... 9.94009e+05 9.95006e+05
9.96003e+05]
[0.00000e+00 9.98000e+02 1.99600e+03 ... 9.95006e+05 9.96004e+05
9.97002e+05]
[0.00000e+00 9.99000e+02 1.99800e+03 ... 9.96003e+05 9.97002e+05
9.98001e+05]]

And here with vectorization:

16.4. Vectorization 167

Data Science and Artificial Intelligence for Undergraduates

def products_vectorized(a, b):

result = np.array([a]).T * np.array([b])
return result

print(products_vectorized(range(1000), range(1000)))

[[ 0 0 0 ... 0 0 0]
[ 0 1 2 ... 997 998 999]
[ 0 2 4 ... 1994 1996 1998]
...
[ 0 997 1994 ... 994009 995006 996003]
[ 0 998 1996 ... 995006 996004 997002]
[ 0 999 1998 ... 996003 997002 998001]]

Hint: The T member variable of a NumPy array provides the transposed array. It’s not a copy (expensive) but a
view (cheap). For detail see Linear Algebra Functions (page 177).

Execution times:

%%timeit
products_loop(range(1000), range(1000))

248 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
products_vectorized(range(1000), range(1000))

1.25 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Speed-up is factor 200 for lists of lenth 1000.

16.4.3 Important

Whenever you see a loop in numerical routines, spend some time to vectorize it. Almost always that’s possible. Often
vectorization also increases readability of the code.

16.5 Array Manipulation Functions

NumPy comes with lots of functions for manipulating arrays. Some of them are needed more often, others almost
never. A comprehensive list is provided in array manipulation routines131 . Here we only mention some of the more
important ones.

import numpy as np

131 https://numpy.org/doc/stable/reference/routines.array-manipulation.html

168 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.5.1 Modifying Shape with reshape

A NumPy array’s reshape132 method yields an array of different shape, but with identical data. The new array has
to have the same number of elements as the old one.

a = np.ones(5) # 1d (vector)
b = a.reshape(1, 5) # 2d (row matrix)
c = a.reshape(5, 1) # 2d (column matrix)

print(a, '\n')
print(b, '\n')
print(c)

[1. 1. 1. 1. 1.]

[[1. 1. 1. 1. 1.]]

[[1.]
[1.]
[1.]
[1.]
[1.]]

One dimension may be replaced by -1 indicating that the size of this dimension shall be computed by NumPy:

a = np.ones((8, 8))
b = a.reshape(4, -1)

print(a.shape, b.shape)

(8, 8) (4, 16)

16.5.2 Mirrowing with fliplr and flipud

To mirrow a 2d array on its vertical or horizontal axis use fliplr133 and flipud134 , respectively.

a = np.array([[1, 2, 3], [4, 5, 6]])

b = np.fliplr(a)

print(a, '\n')
print(b)

[[1 2 3]
[4 5 6]]

[[3 2 1]
[6 5 4]]

132 https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
133 https://numpy.org/doc/stable/reference/generated/numpy.fliplr.html
134 https://numpy.org/doc/stable/reference/generated/numpy.flipud.html

16.5. Array Manipulation Functions 169

Data Science and Artificial Intelligence for Undergraduates

16.5.3 Joining Arrays with concatenate and stack

Arrays of identical shape (except for one axis) may be joined along an existing axis to one large array with con-
catenate135 .

a = np.ones((2, 3))
b = np.zeros((2, 5))
c = np.full((2, 2), 5)
d = np.concatenate((a, b, c), axis=1)

print(d)

[[1. 1. 1. 0. 0. 0. 0. 0. 5. 5.]
[1. 1. 1. 0. 0. 0. 0. 0. 5. 5.]]

If identically shaped array shall be joined along a new axis, use stack136 .

a = np.ones(2)
b = np.zeros(2)
c = np.full(2, 5)
d = np.stack((a, b, c), axis=1)

print(d)

[[1. 0. 5.]
[1. 0. 5.]]

16.5.4 Appending Data with append

Like Python lists NumPy arrays may be extended by appending further data. The append137 method takes the
original array and the new data and returns the extended array.

a = np.ones((3, 3))
b = np.append(a, [[1, 2, 3]], axis=0)

print(b)

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 2. 3.]]

135 https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
136 https://numpy.org/doc/stable/reference/generated/numpy.stack.html
137 https://numpy.org/doc/stable/reference/generated/numpy.append.html

170 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.6 Copies and Views

NumPy arrays may be very large. Thus, having too many copies of one and the same array (or subarrays) is expensive.
NumPy implements a mechanism to avoid copying arrays if not necessary by sharing data between arrays. The
programmer has to take care of which arrays share data und which arrays are independent from others.

import numpy as np

16.6.1 Views

A view of a NumPy array is a usual ndarray object, that shares data with another array. The other array is called
the base array of the view.
Views can be created with an array’s view method. The base object is accessible through a view’s base member
variable.

a = np.ones((100, 100))
b = a.view()

print('id of a:', id(a))

print('id of b:', id(b))
print('base of a:', a.base)
print('id of base of b:', id(b.base))

id of a: 139825110801392
id of b: 139825110801776
base of a: None
id of base of b: 139825110801392

The view method is rarely called directly (might be used for type conversions without copying), but views frequently
originate from calling shape manipulation functions like reshape or fliplr:

b = a.reshape(10, 1000)
c = np.fliplr(a)

print(a.shape)
print(b.shape, b.base is a)
print(c.shape, c.base is a)

(100, 100)
(10, 1000) True
(100, 100) True

Operations on views alter the base array’s (and other view’s) data:

b[0, 0] = 5

print(a[0, 0], b[0, 0], c[0, -1])

5.0 5.0 5.0

Important: Writing data to views modifies the base array! This is a common source of errors, which are very hard
to track down. Always keep track of which of your arrays are views!

16.6. Copies and Views 171

Data Science and Artificial Intelligence for Undergraduates

16.6.2 Slicing Creates Views

Views may be smaller than the original array. Such views of subarrays originate from slicing operations:

a = np.ones((100, 100))
b = a[4:10, :]
c = a[5]

print(a.shape)
print(b.shape, b.base is a)
print(c.shape, c.base is a)

(100, 100)
(6, 100) True
(100,) True

Again modifying a view alters the original array:

b[1, 0] = 5

print(a[5, 0], b[1, 0], c[0])

5.0 5.0 5.0

16.6.3 Copies

A NumPy array’s copy method yields a (deep) copy of an array.

a = np.ones((100, 100))
b = a.copy()

b[0, 0] = 5

print(a[0, 0], b[0, 0])

print(b.base)

1.0 5.0
None

Hint: NumPy arrays are mutable objects. Thus, assigning a new name to an array or passing an array to a function
does not copy the array. Keeping this in mind is very important because functions you call in your code may alter
you arrays. The other way round, writing functions other people might use, clearly indicate in the documentation if
your function modifies arrays passed as parameters. If in doubt, use copy.

172 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.7 Efficiency Considerations

When working with large arrays the most expensive operation (longest execution time) is copying arrays. Thus, we
should avoid making copies of arrays. But there are some more things to consider when optimizing execution time
and memory consumption.

import numpy as np

16.7.1 Don’t Use append in Loops

Sometimes data comes in in chunks and we have to build a large array step by step. We could start with an empty
array and append each new chunk of data. If incoming chunks are single numbers, code could look as follows:

%%timeit

a = np.array([], dtype=np.int64)

for k in range(0, 100):

a = np.append(a, k)

396 µs ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Each call to append creates a new (larger) array and copies the existing one into the new one. In the end we made
100 expensive copy operations.
If we know the final size of our array in advance, then we should create an array of final size before filling it with
data:

%%timeit

a = np.empty(100, dtype=np.int64)

for k in range(0, 100):

a[k] = k

9.12 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

We get a speed-up of factor 40, because repeated copying is avoided.

16.7.2 Append to Lists Instead of Arrays

If data comes in in chunks and we do not know the final array’s size in advance, we should use a Python list for
temporarily storing data. Appending to a Python list is cheap, because existing list data won’t be copied. Each list
item has its own (more or less random) location in memory. If data is complete, we create a NumPy array of correct
size and copy the list’s items to the array.

%%timeit

a = []
for k in range(0, 100):
a.append(k)

b = np.array(a, dtype=np.int64)

16.7. Efficiency Considerations 173

Data Science and Artificial Intelligence for Undergraduates

11.5 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Speed-up compared to np.append is factor 35.

16.7.3 Use Multidimensional Indices

For multidimensional arrays we have two indexing variants:

• multidimensional indexing (e.g., a[0, 0, 0]),
• repeated onedimensional indexing (e.g., a[0][0][0]).
The latter creates a lower-dimensional slice a[0], then indexes this slice, creating another slice, and so on. This
process is less efficient than using multidimensional indices.

%%timeit

a = np.ones((10, 10, 10))

for k in range(0, 100):
b = a[0][0][0]

37.3 µs ± 5.83 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit

a = np.ones((10, 10, 10))

for k in range(0, 100):
b = a[0, 0, 0]

13.2 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Speed-up is almost factor 3.

16.7.4 Remove Unused Arrays from Memory

Working with large arrays we should free memory as soon as possible (use del). A non-obvious situation, where
memory can be freed, is when using small views of large arrays. Consider the following code:

a = np.ones((100, 100)) # large array resulting from some computation

b = a[0, :] # we only need the first row
del a

Here the large array remains in memory although we only need the first row. Because the view b is based on the array
object a, del a only removes the name a, but garbage collection cannot remove the array object. More efficient
code:

a = np.ones((100, 100)) # large array resulting from some computation

b = a[0, :].copy() # we only need the first row, make a copy
del a # remove large array from memory

Here only the first (copied) row remains in memory. The original large array will be removed from memory by
Python’s garbage collection as soon as possible.

174 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.8 Special Floats

NumPy does not raise an exception if we use mathematical operations which lead to undefined results. Instead,
NumPy prints a warning and returns one of several special floating point numbers.

import numpy as np

16.8.1 Infinity

Results of some undefined operations can be interpreted as plus or minus infinity. NumPy represents infinity by the
special float np.inf:

a = np.array([1, -1])
b = a / 0

print(b)
print(type(b[0]))

[ inf -inf]
<class 'numpy.float64'>

/tmp/ipykernel_69952/2743290674.py:2: RuntimeWarning: divide by zero␣

↪encountered in true_divide

b = a / 0

If there is some good reason we may also use np.inf in assignments:

a = np.inf

print(a)

inf

Also well defined operations may lead to infinity if the range of floats is exhausted:

print(np.array([1.23]) ** 10000)

[inf]

/tmp/ipykernel_69952/1071891269.py:1: RuntimeWarning: overflow encountered in␣

↪power

print(np.array([1.23]) ** 10000)

Calculations with np.inf behave as expected:

print(np.inf + 1)
print(5 * np.inf)
print(0 * np.inf)
print(np.inf - np.inf)

inf
inf
nan
nan

16.8. Special Floats 175

Data Science and Artificial Intelligence for Undergraduates

16.8.2 Not a Number

Results of undefined operations which cannot be interpreted as infinity are represented by the special float np.nan
(not a number).

a = np.array([-1, 1])
b = np.log(a)

print(b)

[nan 0.]

/tmp/ipykernel_69952/3972426452.py:2: RuntimeWarning: invalid value encountered␣

↪in log

b = np.log(a)

Calculations with np.nan always lead to np.nan:

print(np.nan + 1)
print(5 * np.nan)
print(0 * np.nan)

nan
nan
nan

16.8.3 Comparison of and to Special Floats

Don’t use usual comparison operators for testing for special floats. They show strange (but well-defined) behavior:

print(np.nan == np.nan)
print(np.inf == np.inf)

False
True

Instead, call np.isnan138 or np.isposinf139 (for +∞) or np.isneginf140 (for −∞) or np.isinf141 (for
both).
Comparisons between finite numbers and np.inf are okay:

print(5 < np.inf)

True

138 https://numpy.org/doc/stable/reference/generated/numpy.isnan.html
139 https://numpy.org/doc/stable/reference/generated/numpy.isposinf.html
140 https://numpy.org/doc/stable/reference/generated/numpy.isneginf.html
141 https://numpy.org/doc/stable/reference/generated/numpy.isinf.html

176 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.8.4 NumPy Functions Handling np.nan

Some NumPy function come in two variants, which differ in handling np.nan. Look at amin142 and nanmin143 , for
instance.

16.9 Linear Algebra Functions

NumPy has a submodule linalg for functions related to linear algebra, but the base numpy module also con-
tains some linear algebra functions. Linear algebra144 in NumPy’s documentation provides a comprehensive list of
functions. Here we only provide few examples.

import numpy as np

For mathematical definitions see Linear Algebra (page 333).

16.9.1 Vector Products

For inner products use np.inner145 . For outer products use np.cross146 .

a = np.array([1, 2, 3])
b = np.array([1, 0, 2])

print(np.inner(a, b))
print(np.cross(a, b))

7
[ 4 1 -2]

16.9.2 Matrix Transpose

The np.transpose147 function yields the transpose of a matrix. Alternatively, a NumPy array’s member variable
T holds the transpose, too. The transpose is a view (not a copy) of the original matrix.

A = np.array([[1, 2, 3],
[4, 5, 6]])

print(np.transpose(A))
print(A.T)

[[1 4]
[2 5]
[3 6]]
[[1 4]
[2 5]
[3 6]]

142 https://numpy.org/doc/stable/reference/generated/numpy.amin.html
143 https://numpy.org/doc/stable/reference/generated/numpy.nanmin.html
144 https://numpy.org/doc/stable/reference/routines.linalg.html
145 https://numpy.org/doc/stable/reference/generated/numpy.inner.html
146 https://numpy.org/doc/stable/reference/generated/numpy.outer.html
147 https://numpy.org/doc/stable/reference/generated/numpy.transpose.html

16.9. Linear Algebra Functions 177

Data Science and Artificial Intelligence for Undergraduates

16.9.3 Matrix Multiplication

NumPy introduces the @ operator for matrix multiplication. It’s equivalent to calling np.matmul148 .

A = np.array([[1, 2, 3],
[4, 5, 6]])
B = np.array([[1, 0],
[2, 1],
[1, 1]])

print(A @ B)
print(np.matmul(A, B))

[[ 8 5]
[20 11]]
[[ 8 5]
[20 11]]

16.9.4 Determinants and Inverses

Determinants and inverses of square matrices can be computed with np.linalg.det149 and np.linalg.
inv150 , respectively.

A = np.array([[2, 0],
[1, 1]])

print(np.linalg.det(A))
print(np.linalg.inv(A))

2.0
[[ 0.5 0. ]
[-0.5 1. ]]

16.9.5 Solving Systems of Linear Equations

The np.solve151 function solves a system of linear equations.

A = np.array([[2, 0],
[1, 1]])
b = np.array([2, 3])

print(np.linalg.solve(A, b))

[1. 2.]

148 https://numpy.org/doc/stable/reference/generated/numpy.matmul.html
149 https://numpy.org/doc/stable/reference/generated/numpy.linalg.det.html
150 https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
151 https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html

178 Chapter 16. Efficient Computations with NumPy

Data Science and Artificial Intelligence for Undergraduates

16.10 Random Numbers

In data science contexts random numbers are important for simulating data and for selecting random subsets of data.
NumPy provides a submodule random for creating arrays of (pseudo-)random numbers.

import numpy as np

16.10.1 Random Number Generators

NumPy supports several different algorithms for generating random numbers. For our purposes choice of algorithm
does not matter (for crypto applications it matters!). Luckily NumPy provides a default one.
We first have to create a random number generator object (or get the default one) and initialize it with a seed. The
seed determines the sequence of generated random numbers. Using a fixed seed is important if we need reproducable
results (when testing things, for instance).

rng = np.random.default_rng(123) # use some integer as seed here

16.10.2 Getting Random Numbers

Random numbers may follow different distributions. NumPy provides many standard distributions, see Random
Generator152 in NumPy’s documentation.

# random integers (arguments: first, last + 1, shape)

a = rng.integers(23, 42, (4, 10))

print(a)

[[23 35 34 24 40 27 27 26 29 26]
[29 38 31 40 31 28 37 38 39 39]
[23 32 28 27 27 38 38 27 30 37]
[25 34 31 40 37 27 38 38 27 32]]

# uniformly distributed floats in [0, 1)

a = rng.random((4, 4))

print(a)

[[0.23155562 0.16590399 0.49778897 0.58272464]

[0.18433799 0.01489492 0.47113323 0.72824333]
[0.91860049 0.62553401 0.91712257 0.86469025]
[0.21814287 0.86612743 0.73075194 0.27786529]]

# permutation of an array
a = np.array([1, 2, 3, 4 ,5])
b = rng.permutation(a)

print(b)

[4 2 5 3 1]

152 https://numpy.org/doc/stable/reference/random/generator.html#simple-random-data

16.10. Random Numbers 179

Data Science and Artificial Intelligence for Undergraduates

180 Chapter 16. Efficient Computations with NumPy

CHAPTER

SEVENTEEN

SAVING AND LOADING NON-STANDARD DATA

There exist Python modules for almost all standard file formats. Readers and writers for several formats also are
included in larger packages like matplotlib, opencv, pandas. To share data with others always use some
standard file format (PNG or JPEG for images, CSV for tabulated data, and so one).
For storing temporary data like interim results NumPy and the pickle module from Python’s standard library
provide very convenient quick-and-dirty functions. Next to those functions, in this chapter we also discuss how to
read custom binary file formats.
Related projects:
• MNIST Character Recognition (page 311)
– The xMNIST Family of Data Sets (page 311)
– Load QMNIST (page 313)

17.1 Saving and Loading NumPy Arrays

NumPy provides functions for saving arrays to files and for loading arrays from files.

import numpy as np

17.1.1 One Array per File

With np.save153 we can write one array to a file.

a = np.array([1, 2, 3])

np.save('some_array.npy', a)

The np.load154 functions reads an array from a file written with np.save:

a = np.load('some_array.npy')

print(a)

[1 2 3]

153 https://numpy.org/doc/stable/reference/generated/numpy.save.html
154 https://numpy.org/doc/stable/reference/generated/numpy.load.html

181
Data Science and Artificial Intelligence for Undergraduates

17.1.2 Multiple Arrays

To save multiple arrays to one file use np.savez155 and provide each array as a keyword argument. The result is
the same as calling save and creating an uncompressed (!) ZIP archive containing all files. File names in the ZIP
archive correspond to keyword argument names.

a = np.array([1, 2, 3])
b = np.array([4, 5])

np.savez('many_arrays.npz', a=a, b=b)

Use np.load156 to load multiple arrays written with savez. The returned object is dict-like, that is, it behaves
like a dictionary, but isn’t of type dict. Conversion to dict works as expected.

with np.load('many_arrays.npz') as data: # data is dict-like

a = data['a']
b = data['b']

print(a)
print(b)

[1 2 3]
[4 5]

To get a compressed ZIP archive use np.savez_compressed157 .

17.2 Saving and Loading Arbitrary Python Objects

The pickle module provides functions for pickling (saving) and unpickling (loading) almost arbitrary Python objects
to and from files, respectively. For details on what objects are picklable see documentation of the pickle module158 .

import pickle

There exist two interfaces: either use the functions dump and load or create a Pickler and an Unpickler
object. Here we only discuss the former variant. For the latter see pickle module159 in Python’s documentation.

17.2.1 Pickling

Steps for pickling are:

1. Open a file for writing in binary mode.
2. Call dump160 for each object to pickle.
3. Close the file.

some_object = [1, 2, 3, 4]
another_object = 'I\'m a string.'

with open('test.pkl', 'wb') as f:

(continues on next page)
155 https://numpy.org/doc/stable/reference/generated/numpy.savez.html
156 https://numpy.org/doc/stable/reference/generated/numpy.load.html
157 https://numpy.org/doc/stable/reference/generated/numpy.savez_compressed.html
158 https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled
159 https://docs.python.org/3/library/pickle.html
160 https://docs.python.org/3/library/pickle.html#pickle.dump

182 Chapter 17. Saving and Loading Non-Standard Data

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

pickle.dump(some_object, f)
pickle.dump(another_object, f)

17.2.2 Unpickling

Steps for unpickling are:

1. Open the file for reading in binary mode.
2. Call load161 for each object to unpickle.
3. Close the file.

with open('test.pkl', 'rb') as f:

some_object = pickle.load(f)
another_object = pickle.load(f)

print(some_object)
print(another_object)

[1, 2, 3, 4]
I'm a string.

Unpickling objects from unknown sources is a security risk. See pickle’s documentation162 .

17.2.3 (Un)Pickling many Objects

If you have many objects to pickle, create a list of all objects and pickle the list. The advantage is, that for unpickling
you do not have to remember how many objects you have pickled. Simply unpickle the list and look at its length.

17.3 Reading Custom Binary File Formats

Sometimes data comes in custom binary formats for which no library functions exist. To read data from binary files
we have to know how to interpret the data. Which bytes represent text? Which bytes represent numbers? And so on.
Without format specification binary files are almost useless.

17.3.1 Viewing Binary Files

To view binary files use a hex editor. A hex editor shows a file byte by byte, where each byte is shown as two
hexadecimal digits. If you do not have a hex editor installed, try wxHexEditor163 .
Most binary files are composed of strings, bit masks, integers, floats, and padding bytes. The hex editor shows
common interpretations of bytes at current cursor position.
161 https://docs.python.org/3/library/pickle.html#pickle.load
162 https://docs.python.org/3/library/pickle.html
163 https://www.wxhexeditor.org

17.3. Reading Custom Binary File Formats 183

Data Science and Artificial Intelligence for Undergraduates

Fig. 17.1: A hex editor shows file contents in hexadecimal notation and as ASCII characters (right column) together
with common interpretations (lower panel).

17.3.2 Reading Strings

We already discussed decoding binary data to strings in the chapter on Text Files (page 114). The only question is
how to find the end of a string. This question should be answered in the format specification. Usually string data is
terminated by a byte with value 0.

17.3.3 Reading Bit Masks

Bit masks are bytes in which each bit describes a truth value. To extract a bit from a byte all programming languages
provide bitwise operators. Here we interpret a byte as sequence of 8 bits. Following bitwise operations can be used:
• a & b returns 1 at a bit position if and only if a and b are both 1 at this position (bitwise and).
• a | b returns 1 at a bit position if and only if at least one of a and b is 1 at this position (bitwise or)
• a ^ b returns 1 at a bit position if and only if exactly one of a and b is 1 at this position (bitwise exclusive or)
• ~a returns 1 at a bit position if and only if a is 0 at this position (bitwise not)
Python implements these bitwise operators for signed integers, which results in somewhat unexpected results (but it’s
the only way since Python has no unsigned integers). Thus, better use NumPy’s types.
To read the third bit use & 0b00100000:

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)

bit_mask = np.uint8(0b10111100)

# get bit and convert result from int to bool

third_bit = bool(bit_mask & np.uint8(0b00100000))
(continues on next page)

184 Chapter 17. Saving and Loading Non-Standard Data

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

third_bit

True

To set the third bit to 1 (when writing binary files) use | 0b00100000.

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)

bit_mask = np.uint8(0b10011100)

# update bit mask (set third bit without modifying others)

bit_mask = bit_mask | np.uint8(0b00100000)

bin(bit_mask)

'0b10111100'

To set the third bit to 0 (when writing binary files) use & ~0b00100000.

# some integer to be interpreted as bit mask (prefix 0b indicates binary notation)

bit_mask = np.uint8(0b10111100)

# update bit mask (set third bit without modifying others)

bit_mask = bit_mask & ~np.uint8(0b00100000)

bin(bit_mask)

'0b10011100'

17.3.4 Reading Integers

Integer values in a binary file may have different lengths, starting from 1 byte upto 8 byte. Reading a 1-byte-integer is
very simple. Just read the byte. For two-byte integers things become more involved. There is a first (closer to begin
of file) and a second byte and there is no universally accepted rule for converting two bytes to an integer. Denoting
the first byte by 𝑎 and the second by 𝑏 there are two possibilities:
• 𝑎 + 256 𝑏 (least significant byte first, little endian, Intel format)
• 256 𝑎 + 𝑏 (most significant byte first, big endian, Motorola format)
If we have 4-byte integers, the problem persists. With bytes 𝑎, 𝑏, 𝑐, 𝑑 we have
• 𝑎 + 256 𝑏 + 2562 𝑐 + 2563 𝑑 (little endian)
3 2
• 256 𝑎 + 256 𝑏 + 256 𝑐 + 𝑑 (big endian)
Analogously for 8-byte integers.
NumPy provides the fromfile164 function to read integers and other numeric data from binary files. Next to
offset (starting position) and count (number of items to read) it has a dtype keyword argument. Usual Python
and NumPy types are allowed, but more detailed type control is possible by providing a string consisting of:
• '<' (little endian) or '>' (big endian) and
• 'i' (signed integer) or 'u' (unsigned integer) and
• length of item in bytes.
164 https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html

17.3. Reading Custom Binary File Formats 185

Data Science and Artificial Intelligence for Undergraduates

Reading unsigned 32-bit integers in little endian notation would require '<u4', for instance.
If data is already in memory, use frombuffer165 instead of fromfile.

data = bytes([200, 3, 4, 5])

# 4 unsigned 8-bit integers

a = np.frombuffer(data, 'u1')
print(a)

# 4 signed 8-bit integers

a = np.frombuffer(data, 'i1')
print(a)

# 2 unsigned 16-bit integers (little endian)

a = np.frombuffer(data, '<u2')
print(a)

# 2 unsigned 16-bit integers (big endian)

a = np.frombuffer(data, '>u2')
print(a)

# 1 unsigned 32-bit integer (big endian)

a = np.frombuffer(data, '>u4')
print(a)

# 1 signed 32-bit integer (big endian)

a = np.frombuffer(data, '>i4')
print(a)

[200 3 4 5]
[-56 3 4 5]
[ 968 1284]
[51203 1029]
[3355640837]
[-939326459]

See Byte-swapping166 for more detailes on NumPy’s support of endianess.

165 https://numpy.org/doc/stable/reference/generated/numpy.frombuffer.html
166 https://numpy.org/doc/1.19/user/basics.byteswapping.html

186 Chapter 17. Saving and Loading Non-Standard Data

CHAPTER

EIGHTEEN

HIGH-LEVEL DATA MANAGEMENT WITH PANDAS

The Pandas167 Python package makes NumPy’s efficient computing capabilities more accessible for data science
purposes and adds functionality for complex data types like timestamps, time periods and categories. Pandas provides
powerful data structures and functions for managing, transforming and analyzing data.
• Series (page 188)
• Data Frames (page 198)
• Advanced Indexing (page 207)
• Dates and Times (page 217)
• Categorical Data (page 223)
• Restructuring Data (page 227)
• Performance Issues (page 233)
Related exercises:
• Pandas Basics (page 273)
• Pandas Indexing (page 276)
• Advanced Pandas (page 278)
• Pandas Vectorization (page 281)
Related projects:
• Public Transport (page 317)
– Get Data and Set Up the Environment (page 317)
– Find Connections (page 322)
• Corona Deaths (page 325)
• Weather (page 305)
– Climate Change (page 309)
167 https://pandas.pydata.org/

187
Data Science and Artificial Intelligence for Undergraduates

18.1 Series

Pandas Series is one of two fundamental Pandas data types (the other is DataFrame). A Series object holds
one-dimensional data, like a list, but with more powerful indexing capabilities. Data is stored in an underlying one-
dimensional NumPy array. Thus, most operations are much more efficient than with lists.

import pandas as pd

18.1.1 Creation of Series Objects

A Series object can be created from a Python list or a dictionary, for instance. See Series constructor168 in Pandas’
documentation.

s = pd.Series([23, 45, 67, 78, 90])

0 23
1 45
2 67
3 78
4 90
dtype: int64

s = pd.Series({'a': 12, 'b': 23, 'c': 45, 'd': 67})

a 12
b 23
c 45
d 67
dtype: int64

A Series consists of an index (first column printed) and its data (second column printed). All data items have to
be of identical type. The length of a Series is provided by the size member variable (you may also use Python’s
built-in function len).

s.size

18.1.2 Data Alignment

Data in a Series behaves like a one-dimensional ndarray, but Pandas’ indexing mechanisms make things different
from NumPy. Pandas implements automatic data alignment. That is, data items do not have fixed positions like in a
NumPy array. Instead, only the (possibly non-integer) index matters. Here is a first example:

a = pd.Series({'a': 2, 'b': 4, 'c': 3, 'd': 6})

b = pd.Series({'a': 1, 'b': 5, 'd': 7, 'e': 9})
print(a, '\n')
print(b, '\n')
print(a + b)

168 https://pandas.pydata.org/docs/reference/api/pandas.Series.html

188 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

a 2
b 4
c 3
d 6
dtype: int64

a 1
b 5
d 7
e 9
dtype: int64

a 3.0
b 9.0
c NaN
d 13.0
e NaN
dtype: float64

Both series have indices a, b, d. Thus, addition is defined. But c and e appear only in one of the series. Addition
fails and the result is not a number.

Important: Note that data type now is float although every number is an integer. The reason is, that integers do not
allow to represent the float NaN. Thus, Pandas has to change to data type of the result. We will come back to such
NaN problems later on.

If we had used NumPy, then the result would be the sum of two vectors:

import numpy as np

a = np.array([2, 4, 3, 6])
b = np.array([1, 5, 7, 9])

a + b

array([ 3, 9, 10, 15])

18.1.3 Underlying Data Structures

Index and data are accessible via index and array members of Series objects:

s = pd.Series([23, 45, 67, 78, 90])

print(s.index, '\n')
print(s.array, '\n')
print(type(s.index), '\n')
print(type(s.array))

RangeIndex(start=0, stop=5, step=1)

<PandasArray>
[23, 45, 67, 78, 90]
Length: 5, dtype: int64

<class 'pandas.core.indexes.range.RangeIndex'>
(continues on next page)

18.1. Series 189

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

The index member is one of several index types. Index objects will be discussed later on. The array member is
an array type defined by Pandas. If we want to have a NumPy array, we should call to_numpy():

a = s.to_numpy()

print(a, '\n')
print(type(a))

[23 45 67 78 90]

18.1.4 Indexing

Accessing single items or subsets of a series works more or less the same way as for lists or dictionaries or NumPy
arrays.
The flexibility of Pandas’ multiple-items indexing mechanisms sometimes leads to confusion and unexpected erros.
In addition, some features are not well documented and a transition to more predictable and more clearly structured
indexing behavior is in progress.

Overview

There exist four widely used indexing mechanisms (here s is some series):
• s[...]: Python style indexing
• s.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)
• s.loc[...] and s.iloc[...]: new Pandas style indexing
• s.at[...] and s.iat[...]: new Pandas style indexing for more efficient access to single items

Deprecated Indexing

Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing and label
based indexing. Position based means that, like for NumPy arrays, we refer to an item by its position in the series.
The first item has position 0. Thus, the series’ index object is completely ignored. Providing an item of the series’
index member as index, is refered to as label based indexing.
Both [...] and ix[...] behave slightly differently when using slicing. A major problem is that sometimes it
is not clear whether positional or label based indexing shall be used. Consider a series with an index made of id
numbers, that is, integers:

s = pd.Series({123: 3, 45: 4, 542: 7, 2: 19})

print(s, '\n')

print(s[123], '\n') # label based

print(s[2], '\n') # label based
print(s[0:2]) # position based (October 2022 warning: will be label␣
↪based in future)

190 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

123 3
45 4
542 7
2 19
dtype: int64

123 3
45 4
dtype: int64

/tmp/ipykernel_234813/411904015.py:6: FutureWarning: The behavior of␣

↪`series[i:j]` with an integer-dtype index is deprecated. In a future version,␣

↪this will be treated as label-based indexing, consistent with e.g.␣

↪`series[i]` lookups. To retain the old behavior, use `series.iloc[i:j]`. To␣

↪get the future behavior, use `series.loc[i:j]`.

print(s[0:2]) # position based (October 2022 warning: will be label␣

↪based in future)

Without knowing the exact mechanism behind [...], which in fact calls the series’ __getitem__ method, code
becomes unreadable. Same is true for ix. The ix indexer has been removed from Pandas since version 1.0.0
(January 2020). Indexing with [...] is still available, but should be avoided, at least for series with integer labels.

New Indexing Mechanism

Prefered indexing is via loc[...] and iloc[...], the first for label based indexing, the second for positional
indexing. Positional indexing is also known as integer indexing, thus the i in iloc. Slicing and boolean indexing are
supported (see below).
If only a single item shall be accessed, then loc[...] and iloc[...] might be too slow due to the implemen-
tation of complex features like slicing. For single item access one should use at[...] and iat[...] providing
label based and positional indexing, respectively.

Positional Indexing

Positional indexing via iloc[...] or iat[...] works like for one-dimensional NumPy arrays.

s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})

print(s, '\n')

print(s.iloc[1:3], '\n') # slicing

print(s.iloc[[3, 0, 2]], '\n') # list of indices
print(s.iloc[[True, False, False, True, True]], '\n') # boolean indexing
print(s.iat[3], '\n') # efficient single element access
print(s.iloc[3]) # less efficient single element access

a 1
b 2
c 3
d 4
e 5
dtype: int64

b 2
(continues on next page)

18.1. Series 191

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

c 3
dtype: int64

d 4
a 1
c 3
dtype: int64

a 1
d 4
e 5
dtype: int64

An important difference to NumPy indexing is, that the result is a series again. That is, the index of the selected items
is returned, too.

Label Based Indexing

Label based indexing works like with dictionaries. But slicing is allowed.

s = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})

print(s, '\n')

print(s.loc['b':'d'], '\n') # slicing

print(s.loc[['d', 'a', 'c']], '\n') # list of labels
print(s.loc[[True, False, False, True, True]], '\n') # boolean indexing
print(s.at['d'], '\n') # efficient single element access
print(s.loc['d']) # less efficient single element access

a 1
b 2
c 3
d 4
e 5
dtype: int64

b 2
c 3
d 4
dtype: int64

d 4
a 1
c 3
dtype: int64

a 1
d 4
e 5
dtype: int64

192 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

Important: Note that slicing with labels includes the stop item!

Different items with identical labels are allowed. In such case loc[...] returns all items with the specified label
and at[...] returns an array of all values with the specified label.

s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'b', 'c'])

print(s, '\n')

print(s.loc['b'], '\n')
print(s.at['b'])

a 1
b 2
b 3
c 4
dtype: int64

b 2
b 3
dtype: int64

Indexing by Callables

Both loc[...] and iloc[...] accept a function as their argument. The function has to take a series as argument
and has to return something allowed for indexing (list of indices/labels, boolean array and so on).
Scenarios justifying indexing by callables are relatively complex.

Views and Copies

As for NumPy arrays, indexing Pandas series may return a view of the series. That is, modifying the extracted subset
of items might modify the original series. If you really need a copy of the items, use the copy169 method of Series
objects.

18.1.5 Some Useful Member Functions

A full list of member functions for Series objects170 is provided in Pandas’ documentation. Here we only list a few
of them.
169 https://pandas.pydata.org/docs/reference/api/pandas.Series.copy.html
170 https://pandas.pydata.org/docs/reference/series.html

18.1. Series 193

Data Science and Artificial Intelligence for Undergraduates

A First Look at a Series

If a series is read from a file we would like to get some basic information about the series.
With describe171 we get statistical information about a series. The function returns a Series object containing
the collected information.
First and last items are returned by head172 and tail173 , respectively. Both take an optional argument specifying
the number of items to return. Default is 5.

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])

print(s.describe(), '\n')
print(s.head(), '\n')
print(s.tail(3))

count 10.000000
mean 3.200000
std 2.250926
min -2.000000
25% 2.250000
50% 3.500000
75% 4.750000
max 6.000000
dtype: float64

0 2
1 4
2 6
3 5
4 4
dtype: int64

7 3
8 2
9 5
dtype: int64

Note that we did not specify labels explicitly. Thus, the Series constructor uses item positions as labels.

Iterating Over a Series

Iterating over the values of a series works like for Python lists:

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])

for i in s:
print(i)

2
4
6
5
4
3
-2
(continues on next page)
171 https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html
172 https://pandas.pydata.org/docs/reference/api/pandas.Series.head.html
173 https://pandas.pydata.org/docs/reference/api/pandas.Series.tail.html

194 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

3
2
5

If labels are required, too, call items174 :

for lab, val in s.items():

print(lab, val)

0 2
1 4
2 6
3 5
4 4
5 3
6 -2
7 3
8 2
9 5

If next to labels also positional indices are required use an additional enumerate:

for pos, (lab, val) in enumerate(s.items()):

print(pos, lab, val)

0 0 2
1 1 4
2 2 6
3 3 5
4 4 4
5 5 3
6 6 -2
7 7 3
8 8 2
9 9 5

Vectorized Operators

Like NumPy arrays Pandas series implement most mathematical and comparison operators.

a = pd.Series([1, 2, 3, 4])
b = pd.Series([4, 0, 6, 3])

print(a * b, '\n')
print(a < b)

0 4
1 0
2 18
3 12
dtype: int64

0 True
1 False
(continues on next page)
174 https://pandas.pydata.org/docs/reference/api/pandas.Series.items.html

18.1. Series 195

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

2 True
3 False
dtype: bool

Hint: Remember that Pandas uses data alignment, that is, labels matter, positions are irrelevant.

Functions all175 and any176 for boolean series are available, too.

s = pd.Series([True, True, False])

print(s.all())
print(s.any())

False
True

Removing and Adding Items

With drop177 we can remove items from a series. Simply pass a list of labels to the function.

s = pd.Series([2, 4, 6, 5, 4, 3, -2, 3, 2, 5])

print(s, '\n')

t = s.drop([3, 4, 5])
print(t)

0 2
1 4
2 6
3 5
4 4
5 3
6 -2
7 3
8 2
9 5
dtype: int64

0 2
1 4
2 6
6 -2
7 3
8 2
9 5
dtype: int64

The concat178 method concatenates two series.

175 https://pandas.pydata.org/docs/reference/api/pandas.Series.all.html
176 https://pandas.pydata.org/docs/reference/api/pandas.Series.any.html
177 https://pandas.pydata.org/docs/reference/api/pandas.Series.drop.html
178 https://pandas.pydata.org/docs/reference/api/pandas.concat.html

196 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

a = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})

b = pd.Series({'d': 0, 'e': 5, 'f': 6, 'g': 7})

c = pd.concat([a, b])

print(a, '\n')
print(b, '\n')
print(c)

a 1
b 2
c 3
d 4
dtype: int64

d 0
e 5
f 6
g 7
dtype: int64

a 1
b 2
c 3
d 4
d 0
e 5
f 6
g 7
dtype: int64

Note that there is no check on duplicate index labels, since duplicates are no problem (see above).

Modifying Data in a Series

Important functions for modifying data in a series are:

• apply179 (apply a function to each item or to the whole data array),
• combine180 (choose items from two series to form a new one),
• where181 (replace items which do not satisfy a condition),
• mask182 (replace items which satisfy a condition)
179 https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html
180 https://pandas.pydata.org/docs/reference/api/pandas.Series.combine.html
181 https://pandas.pydata.org/docs/reference/api/pandas.Series.where.html
182 https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html

18.1. Series 197

Data Science and Artificial Intelligence for Undergraduates

18.2 Data Frames

A Pandas data frame (a DataFrame object) is a collection of Pandas series with common index. Each series can
be interpreted as a column in a two-dimensional table. There is a second index object for indexing columns.
Data frames are the most important data structure provided by Pandas.

import pandas as pd

18.2.1 Creation DataFrame Objects

A DataFrame object can be created from lists/dictionaries of lists/dictionaries or from NumPy arrays or from
lists/dictionaries of series, for instance. See DataFrame constructor183 in Pandas’ documentation. If necessary row
labels and column labels can be provided via index and columns keyword arguments, respectively.

s1 = pd.Series({'a': 1, 'b': 2})

s2 = pd.Series({'b': 3, 'c': 4})

df = pd.DataFrame({'left': s1, 'right': s2})

left right
a 1.0 NaN
b 2.0 3.0
c NaN 4.0

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['top', 'middle', 'bottom'],
columns=['left', 'middle', 'right'])

left middle right

top 1 2 3
middle 4 5 6
bottom 7 8 9

Hint: JupyterLab shows data frames as graphical table. In other words, Jupyter’s display function yields output
different from print. With print we get a text representation of the data frame.

The shape member contains a tuple with number of rows and columns in the data frame:

df.shape

(3, 3)

183 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

198 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

18.2.2 Data Alignment

Like for series data frames implement data alignment where possible. This applies to row indexing as well as to
column indexing.

df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
df2 = pd.DataFrame([[11, 12, 13], [14, 15, 16], [17, 18, 19]],
index=['b', 'c', 'd'], columns=['A', 'C', 'D'])

display(df1)
display(df2)
display(df1 + df2)

A B C
a 1 2 3
b 4 5 6
c 7 8 9

A C D
b 11 12 13
c 14 15 16
d 17 18 19

A B C D
a NaN NaN NaN NaN
b 15.0 NaN 18.0 NaN
c 21.0 NaN 24.0 NaN
d NaN NaN NaN NaN

18.2.3 Underlying Data Structures

The index object for row indexing is accessible via index member and the index object for column indexing is
accessible via columns member.
Data frames also have a to_numpy184 method returning a data frame’s data as NumPy array.

18.2.4 Indexing

Indexing data frames is very similar to indexing series. The major difference is that for data frames we have to specify
a row and a column instead of only a row.

Overview

There exist four widely used mechanisms (here df is some data frame):
• df[...]: Python style indexing
• df.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)
• df.loc[...] and df.iloc[...]: new Pandas style indexing
• df.at[...] and df.iat[...]: new Pandas style indexing for more efficient access to single items
184 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html

18.2. Data Frames 199

Data Science and Artificial Intelligence for Undergraduates

Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing, label based
indexing and mixed indexing (position for row, label for column, and vice versa). Both [...] and ix[...] behave
slightly different. As already discussed for series, sometimes it is not clear whether positional or label based indexing
shall be used. Thus, [...] should be used with care. The ix indexer has been removed from Pandas in January
2020, but may appear in old code and documents.

Label Based Column Indexing with [...]

Indexing with [...] is mainly used for selecting columns by label. For other purposes new Pandas style indexing
should be used.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

s = df['A']
df_part = df[['B', 'C']]

display(s)
display(df_part)

A B C
a 1 2 3
b 4 5 6
c 7 8 9

a 1
b 4
c 7
Name: A, dtype: int64

B C
a 2 3
b 5 6
c 8 9

If only one label is provided, then the result is a Series object. If a list of labels is provided, then the result is a
DataFrame object. In case of integer labels, label based indexing is used when selecting columns:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=[2, 23, 45])
display(df)
display(df[2])

2 23 45
a 1 2 3
b 4 5 6
c 7 8 9

a 1
b 4
c 7
Name: 2, dtype: int64

Hint: Label based column indexing can be used to create new columns: df['my_new_column'] = 0 creates

200 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

a new column containing zeros, for instance.

Positional Indexing

Positional indexing via iloc[...] or iat[...] works like for two-dimensional NumPy arrays.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.iloc[1:3, 0:2]) # slicing

display(df.iloc[[1, 0], [2, 1, 0]]) # list of indices
display(df.iloc[[True, False, True], [False, True, True]]) # boolean indexing
display(df.iat[2, 1]) # efficient single element access
display(df.iloc[2, 1]) # less efficient single element access

A B C
a 1 2 3
b 4 5 6
c 7 8 9

A B
b 4 5
c 7 8

C B A
b 6 5 4
a 3 2 1

B C
a 2 3
c 8 9

Variants (slicing, boolean and so on) may be mixed for rows and columns.

Label Based Indexing

Label based indexing works like with dictionaries. But slicing is allowed.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.loc['c':'b':-1, 'A':'B']) # slicing

display(df.loc[['c', 'a'], ['B', 'A']]) # list of labels
display(df.loc[[True, False, True], [False, True, True]]) # boolean indexing
display(df.at['b', 'B']) # efficient single element access
display(df.loc['b', 'B']) # less efficient single element access

18.2. Data Frames 201

Data Science and Artificial Intelligence for Undergraduates

A B C
a 1 2 3
b 4 5 6
c 7 8 9

A B
c 7 8
b 4 5

B A
c 8 7
a 2 1

B C
a 2 3
c 8 9

Important: Like for series label based slicing includes the stop label!

Indexing by Callables

Both loc[...] and iloc[...] accept a function for row index and column index. The function has to take a
data frame as argument and has to return something allowed for indexing (list of indices/labels, boolean array and so
on).

Mixing Positional and Label Based Indexing

Indexing with [...] allows for mixing positional and label based indexing. But [...] should be avoided as
discussed for series. The only exception is column selection. With Pandas’ new indexing style mixed indexing is still
possible but requires some more bytes of code:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

display(df.loc['a':'b', df.columns[0:2]])

A B C
a 1 2 3
b 4 5 6
c 7 8 9

A B
a 1 2
b 4 5

The idea is to use loc[...] and the columns member variable. With columns we get access to the index
object for column indexing. This index object allows for usual positional indexing techniques (slicing, boolean and

202 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

so on). The result of indexing an index object is an index object again. But the new index object only contains the
desired subset of indices. This smaller index object than is passed to loc[...]. Same is possible for rows via
index member.

Views and Copies

Like for NumPy arrays indexing Pandas data frames may return a view of the data frame. That is, modifying the
extracted subset of items might modify the original data frame. If you really need a copy of the items, use the copy185
method of DataFrame objects.

18.2.5 Some Useful Member Functions

A full list of member functions for DataFrame objects186 is provided in Pandas’ documentation. Here we only list
a few.

A First Look at a Data Frame

With describe187 we get basic statistical information about each column holding numerical values. The function
returns a DataFrame object containing the collected information. Only columns with numerical data ae considered
by describe.
First and last rows are returned by head188 and tail189 . Both take an optional argument specifying the number of
rows to return. Default is 5.
The info190 method prints memory usage and other useful information.

df = pd.DataFrame([[1, 2, 3, 'some'], [4, 5, 6, 'string'], [7, 8, 9, 'here']],

index=['a', 'b', 'c'], columns=['A', 'B', 'C', 'D'])
display(df)

display(df.describe())
display(df.head(2))
display(df.tail(2))
df.info()

A B C D
a 1 2 3 some
b 4 5 6 string
c 7 8 9 here

A B C
count 3.0 3.0 3.0
mean 4.0 5.0 6.0
std 3.0 3.0 3.0
min 1.0 2.0 3.0
25% 2.5 3.5 4.5
50% 4.0 5.0 6.0
75% 5.5 6.5 7.5
max 7.0 8.0 9.0

185 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html
186 https://pandas.pydata.org/docs/reference/dataframe.html
187 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
188 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
189 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html
190 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html

18.2. Data Frames 203

Data Science and Artificial Intelligence for Undergraduates

A B C D
a 1 2 3 some
b 4 5 6 string

A B C D
b 4 5 6 string
c 7 8 9 here

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null int64
2 C 3 non-null int64
3 D 3 non-null object
dtypes: int64(3), object(1)
memory usage: 120.0+ bytes

Iterating Over a Data Frame

To iterate over columns use items191 , which returns tuples containing the column label and the column’s data as
Series object.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b', 'c'], columns=['A', 'B', 'C'])
display(df)

for label, s in df.items():

print(label)
print(s, '\n')

A B C
a 1 2 3
b 4 5 6
c 7 8 9

A
a 1
b 4
c 7
Name: A, dtype: int64

B
a 2
b 5
c 8
Name: B, dtype: int64

C
a 3
b 6
c 9
Name: C, dtype: int64

191 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html

204 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

Alternativly, iterate over the columns member variable:

for label in df.columns:

print(label)
print(df[label], '\n')

A
a 1
b 4
c 7
Name: A, dtype: int64

B
a 2
b 5
c 8
Name: B, dtype: int64

C
a 3
b 6
c 9
Name: C, dtype: int64

Iteration over rows can be implemented via iterrows192 method. Analogously to items it returns tuples con-
taining the row label and a Series object with the data.

for label, s in df.iterrows():

print(label)
print(s, '\n')

a
A 1
B 2
C 3
Name: a, dtype: int64

b
A 4
B 5
C 6
Name: b, dtype: int64

c
A 7
B 8
C 9
Name: c, dtype: int64

Hint: Data in each column of a data frame has identical type, but types in a row may differ (from column to column).
Thus, calling iterrows may involve type casting to get row data as Series object.

Important: Usually there is no need to iterate over rows, because Pandas provides much faster vectorized code for
almost all operations needed for typical data science projects.

192 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html

18.2. Data Frames 205

Data Science and Artificial Intelligence for Undergraduates

Vectorized Operators

Mathematical operations and comparisons work in complete analogy to Series objects.

Removing and Adding Items

Functions for removing and adding items:

• concat193 (concatenate data frames vertically or horizontally),
• drop194 (remove rows or columns from data frame).

Modifying Data in a Data Frame

• apply195 (apply function rowwise or columnwise to data frame),

• combine196 (choose items from two data frames to form a new one),
• where197 (replace items which do not satisfy a condition),
• mask198 (replace items which satisfy a condition).

Data Frames from and to CSV Files

The Pandas function read_csv199 reads a CSV file and returns a data frame.
The DataFrame method to_csv200 writes data to a CSV file.

Missing Values

Missing values are a common problem in data science. Pandas provides several functions and mechanisms for handling
missing values. Important functions:
• isna201 (return boolean data frame with True at missing value positions),
• notna202 (return boolean data frame with False at missing value positions),
• fillna203 (fill all missing values, different fill methods are provided),
• dropna204 (remove rows or columns containing missing values or consisting completely of missing values).
Details may be found in Pandas user guide205 .
193 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.concat.html
194 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
195 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html
196 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html
197 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
198 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html
199 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
200 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
201 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html
202 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html
203 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
204 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
205 https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

206 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

18.3 Advanced Indexing

One of Pandas’ most useful features is its powerful indexing mechanism. Here we’ll discuss several types of index
objects.

import pandas as pd

Series and data frames have one or two index objects, respectively. They are accessible via Series.index or
DataFrame.index and DataFrame.columns. An index object is a list-like object holding all row or column
labels.
To create a new index, call the constructor of the Index class:

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])

new_index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

18.3.1 Reindexing

An index object may be replaced by another one. We have to take care whether data alignment between old and new
index shall be applied or not.

With Data Alignment

Series and DataFrame objects provide the reindex206 method. This method takes an index object (or a list
of labels) and replaces the existing index by the new one. Data alignment is applied, that is, rows/columns with label
in the intersection of old and new index remain unchanged, but rows/columns with old label not in the new index are
dropped. If there are labels in the new index which aren’t in the old one, then rows/columns are filled with nan or
some specified value or by some more complex filling logic.

s = pd.Series({'a': 123, 'b': 456, 'e': 789})

print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])

s = s.reindex(new_index)
s

a 123
b 456
e 789
dtype: int64

a 123.0
b 456.0
c NaN
d NaN
e 789.0
dtype: float64

With fill value:

206 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html

18.3. Advanced Indexing 207

Data Science and Artificial Intelligence for Undergraduates

s = pd.Series({'a': 123, 'b': 456, 'e': 789})

print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])

s = s.reindex(new_index, fill_value=0)
s

a 123
b 456
e 789
dtype: int64

a 123
b 456
c 0
d 0
e 789
dtype: int64

With filling logic:

s = pd.Series({'a': 123, 'b': 456, 'e': 789})

print(s, '\n')

new_index = pd.Index(['a', 'b', 'c', 'd', 'e'])

s = s.reindex(new_index, method='bfill')
s

a 123
b 456
e 789
dtype: int64

a 123
b 456
c 789
d 789
e 789
dtype: int64

The align207 method reindexes two series/data frames such that both have the same index.

s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})

s2 = pd.Series({'a': 98, 'c': 76, 'e': 54})
print(s1, '\n')
print(s2, '\n')

s1, s2 = s1.align(s2, axis=0)

print(s1, '\n')
print(s2)

a 123
b 456
e 789
dtype: int64
(continues on next page)
207 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.align.html

208 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

a 98
c 76
e 54
dtype: int64

a 123.0
b 456.0
c NaN
e 789.0
dtype: float64

a 98.0
b NaN
c 76.0
e 54.0
dtype: float64

Without Data Alignment

To simply replace an index without data alignment, that is to rename all the labels, there are two variants:
• replace the index object by a new one of same length via usual assignment,
• use an existing column as index.

s = pd.Series({'a': 123, 'b': 456, 'e': 789})

print(s, '\n')

new_index = pd.Index(['aa', 'bb', 'cc'])

s.index = new_index
s

a 123
b 456
e 789
dtype: int64

aa 123
bb 456
cc 789
dtype: int64

To use a column of a data frame as index call the set_index208 method and provide the column label.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],

index=['a', 'b','c'], columns=['A', 'B', 'C'])
display(df)

df = df.set_index('A')
df

A B C
a 1 2 3
b 4 5 6
c 7 8 9

208 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html

18.3. Advanced Indexing 209

Data Science and Artificial Intelligence for Undergraduates

B C
A
1 2 3
4 5 6
7 8 9

To convert the index to a usual column call reset_index209 . The index will be replaced by the standard index
(integers starting at 0).

df = df.reset_index()
df

A B C
0 1 2 3
1 4 5 6
2 7 8 9

18.3.2 Index Sharing

Index objects may be shared between several series or data frames. Simply pass the index of an existing series or data
frame to the constructor of a new series or data frame or assign it directly or use reindex.

s1 = pd.Series({'a': 123, 'b': 456, 'e': 789})

s2 = pd.Series([4, 6, 8], index=s1.index)

s1.index is s2.index

True

Note: Reindexing two series/data frames with align (see above) results in a shared index.

18.3.3 Enlargement by Assignment

To append data to a series or data frame we may use label based indexing:

s = pd.Series({'a': 123, 'b': 456})

print(s, '\n')

s.loc['e'] = 789
s

a 123
b 456
dtype: int64

a 123
b 456
e 789
dtype: int64

209 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html

210 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

df = pd.DataFrame([[1, 2], [3, 4]], index=['a', 'b'], columns=['A', 'B'])

display(df)

df.loc['c', 'D'] = 5
df

A B
a 1 2
b 3 4

A B D
a 1.0 2.0 NaN
b 3.0 4.0 NaN
c NaN NaN 5.0

18.3.4 Range Indices

Pandas’ standard index is of type RangeIndex210 . It’s used whenever a series or a data frame is created without
specifying an index.

index = pd.RangeIndex(5, 21, 2)

print(index, '\n')

for k in index:
print(k)

RangeIndex(start=5, stop=21, step=2)

5
7
9
11
13
15
17
19

18.3.5 Interval Indices

The IntervalIndex211 class allows for imprecise indexing. Each item in a series or data frame can be accessed
by any number in a specified interval.

interval_list = [pd.Interval(2, 3), pd.Interval(6, 7), pd.Interval(6.5, 9)]

print(interval_list, '\n')

s = pd.Series([23, 45, 67], index=pd.IntervalIndex(interval_list, closed='left'))

[Interval(2, 3, closed='right'), Interval(6, 7, closed='right'), Interval(6.5,␣

↪9, closed='right')]

210 https://pandas.pydata.org/docs/reference/api/pandas.RangeIndex.html
211 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.html

18.3. Advanced Indexing 211

Data Science and Artificial Intelligence for Undergraduates

[2.0, 3.0) 23
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64

Indexing by concrete numbers:

print(s.loc[2], '\n')
# print(s.loc[3], '\n') # KeyError
print(s.loc[2.5], '\n')
print(s.loc[6.7])

[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64

Indexing by intervals:

# print(s.loc[pd.Interval(2, 3)]) # KeyError

print(s.loc[pd.Interval(2, 3, 'left')]) # only exact matches!

IntervalIndex objects provide overlaps212 and contains213 methods for more flexible indexing:

mask = s.index.overlaps(pd.Interval(2.5, 6.4))

print(mask)
s.loc[mask]

[ True True False]

[2.0, 3.0) 23
[6.0, 7.0) 45
dtype: int64

mask = s.index.contains(6.7)
print(mask)
s.loc[mask]

[False True True]

[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64

Note: In principle contains should work with intervals instead of concrete numbers, too. But in Pandas 1.5.1
NotImplementedError is raised.

212 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.overlaps.html
213 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.contains.html

212 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

18.3.6 Multi-Level Indexing

Up to some details, multi-level indexing is indexing using tuples as labels. Corresponding indexing objects are of
type MultiIndex214 . Major application of multi-level indices are indices representing several dimensions. Thus,
high dimensional data can be stored in a two-dimensional data frame.

Creating a Multi-Level Index

Let’s start with a two-level index. First level contains courses of studies provided by a university. Second level contains
some lecture series. Data is the number of students from each course attending a lecture and the average rating for
each lecture.
Using the MultiIndex constructor is the most general, but not very straight forward way to create a multi-level
index. We have to provide lists of labels for each level. In addition, we need lists of codes for each level indicating
which label to use at each position. Each level may have a name.

courses = ['Mathematics', 'Physics', 'Philosophie']

lectures = ['Computer Science', 'Mathematics', 'Epistemology']

courses_codes = [0, 0, 0, 1, 1, 1, 2, 2, 2]
lectures_codes = [0, 1, 2, 0, 1, 2, 0, 1, 2]

index = pd.MultiIndex(levels=[courses, lectures],

codes=[courses_codes, lectures_codes],
names=['course', 'lecture'])

data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],

[2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])

df = pd.DataFrame(data, index=index, columns=['students', 'rating'])

students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Epistemology 8 3.6
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Epistemology 89 1.1

Alternative creation methods are MultiIndex.from_arrays215 , MultiIndex.from_tuples216 ,

MultiIndex.from_product217 , MultiIndex.from_frame218 .

Hint: The mentioned creation methods are static methods. See Types (page 80) for some explanation of the
concept.

The above index contains each combination of items from two lists. Thus from_product is applicable:
214 https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html
215 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_arrays.html
216 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_tuples.html
217 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html
218 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html

18.3. Advanced Indexing 213

Data Science and Artificial Intelligence for Undergraduates

courses = ['Mathematics', 'Physics', 'Philosophie']

lectures = ['Computer Science', 'Mathematics', 'Epistemology']

index = pd.MultiIndex.from_product([courses, lectures], names=['course', 'lecture

↪'])

data = zip([10, 15, 8, 20, 17, 3, 2, 1, 89],

[2.1, 1.3, 3.6, 3.0, 1.6, 4.7, 3.9, 4.9, 1.1])

df = pd.DataFrame(data, index=index, columns=['students', 'rating'])

Level information is stored in the levels member variable:

df.index.levels

FrozenList([['Mathematics', 'Philosophie', 'Physics'], ['Computer Science',

↪'Epistemology', 'Mathematics']])

Note, that multi-level indexing is not restricted to row indexing. Multi-level column indexing works in exactly the
same manner.

Accessing Data

Accessing data works as for other types of indices. Labels now are tuples containing one item per level. But there
exist additional techniques specific to multi-level indices.

Single Tuples

df.loc[('Physics', 'Computer Science'), :]

students 20.0
rating 3.0
Name: (Physics, Computer Science), dtype: float64

df.iloc[1, 1]

1.3

df.loc[[('Physics', 'Computer Science'), ('Mathematics', 'Epistemology')], :]

214 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

students rating
course lecture
Physics Computer Science 20 3.0
Mathematics Epistemology 8 3.6

Slicing works as usual.

Slicing Inside Tuples

A new feature specific to multi-level indexing is slicing inside tuples. We would expect notation like

('Physics', :)

to get all rows with Physics at first level. But usual slicing syntax is not available here. Instead we have to use the
built-in slice function. It takes start, stop and step values (start and step default to None) and returns a slice
object. More precisely, the slice function is the constructor for slice objects. A slice object simply holds three
values (start, stop, step).

df.loc[('Physics', slice(None)), :]

students rating
course lecture
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7

With slice(None) we create a slice object interpreted as all (analogously to :).

Slicing in the first level works, too.

df = df.sort_index()
df.loc[(slice('Mathematics', 'Physics'), 'Epistemology'), :]

students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7

Note that label based slicing above requires a sorted index. Thus, we have to call sort_index219 first.
An alternative to slice is creating a pd.IndexSlice object, which allows for natural slicing syntax:

df.loc[pd.IndexSlice['Mathematics':'Physics', 'Epistemology'], :]

students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7

219 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html

18.3. Advanced Indexing 215

Data Science and Artificial Intelligence for Undergraduates

Building a Mask Based on a Level

The get_level_values220 method of an index object takes a level name or level index as argument and returns
a simple Index object containing only the labels at the specified level. This object can then be used to create a
boolean array for row indexing.

import numpy as np

no_physics_mask = df.index.get_level_values('course') != 'Physics'

no_epistemology_mask = df.index.get_level_values(1) != 'Epistemology'

mask = np.logical_and(no_physics_mask, no_epistemology_mask)

df.loc[mask, :]

students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Philosophie Computer Science 2 3.9
Mathematics 1 4.9

Comparing an index object with a single value results in a one-dimensional boolean NumPy array with same length
as the index object. NumPy’s logical_and method implements elementwise logical and.

Cross-Sections

There’s a shorthand for selecting all rows with a given label at a given level: xs221 . This method takes a label and a
level and returns corresponding rows as data frame.

df.xs('Mathematics', level='lecture')

students rating
course
Mathematics 15 1.3
Philosophie 1 4.9
Physics 17 1.6

The level Keyword Argument

Many Pandas functions accept a level keyword argument (like xs above or drop222 ) to provide functionality
adapted to multi-level indices.
220 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_level_values.html
221 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
222 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

216 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

More on Multi-Level Indexing

There are many different styles for using multi-level indexing. Some of them are very confusing for beginners,
because same syntax may have different semantics depending on the objects (row/column label, tuple, list) passed as
arguments. Here we only considered save and syntactically clear variants. To get an idea of other indexing styles have
a look at MultiIndex / advanced indexing223 in the Pandas user guide.

18.4 Dates and Times

We already met the datetime module in Web Access (page 122) for handling points of time and time durations.
Pandas extends those capabilities by introducing time periods (durations associated with a point) and more advanced
calendar arithmetics.
Pandas also provides date and time related index objects to easily index time series data: DatetimeIndex,
TimedeltaIndex, PeriodIndex.

import pandas as pd

18.4.1 Time Stamps

The basic data structure for representing points in time are Timestamp224 objects. They provide lots of useful
methods for conversion from and to other date and time formats.

some_day = pd.Timestamp(year=2020, month=2, day=15, hour=12, minute=34)

some_day

Timestamp('2020-02-15 12:34:00')

18.4.2 Time Deltas

The basic data structure for representing durations are Timedelta225 objects. They can be used in their own or to
shift time stamps.

a_long_time = pd.Timedelta(days=10000, minutes=100)

a_long_time

Timedelta('10000 days 01:40:00')

18.4.3 Periods

A period in Pandas is a time interval paired with a time stamp. Interpretation is as follows:
• The interval is one of several preset intervals, like a calendar month or a week from Monday till Sunday. See
Offset aliases226 and Anchored aliases227 for available intervals.
• The time stamp selects a concrete interval, the month or the week containing the time stamp, for instance.
The basic data structure for representing periods are Period228 objects.
223 https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
224 https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html
225 https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html
226 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
227 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets
228 https://pandas.pydata.org/docs/reference/api/pandas.Period.html

18.4. Dates and Times 217

Data Science and Artificial Intelligence for Undergraduates

year2010 = pd.Period('1/1/2010', freq='A')

print(year2010.start_time)
print(year2010.end_time)

2010-01-01 00:00:00
2010-12-31 23:59:59.999999999

year2010oct = pd.Period('1/1/2010', freq='A-OCT')

print(year2010oct.start_time)
print(year2010oct.end_time)

2009-11-01 00:00:00
2010-10-31 23:59:59.999999999

18.4.4 Time Stamp Indices

Using time stamps for indexing offers lots of nice features in Pandas, because Pandas originally has been developed
for handling time series.

Creating Time Stamp Indices

The constructor for DatetimeIndex229 objects takes a list of time stamps and an optional frequency. Frequency
has to match the passed time stamps. If there is no common frequency in the data, the frequency is None (default).
A more convenient method is pd.date_range230 :

index = pd.date_range(start='2018-03-14', freq='2D12H', periods=10)

index

DatetimeIndex(['2018-03-14 00:00:00', '2018-03-16 12:00:00',

'2018-03-19 00:00:00', '2018-03-21 12:00:00',
'2018-03-24 00:00:00', '2018-03-26 12:00:00',
'2018-03-29 00:00:00', '2018-03-31 12:00:00',
'2018-04-03 00:00:00', '2018-04-05 12:00:00'],
dtype='datetime64[ns]', freq='60H')

index = pd.date_range(start='2018-03-14', end='2018-03-20', freq='D')

index

DatetimeIndex(['2018-03-14', '2018-03-15', '2018-03-16', '2018-03-17',

'2018-03-18', '2018-03-19', '2018-03-20'],
dtype='datetime64[ns]', freq='D')

We may also use columns of an existing data frame to create a DatetimeIndex:

df = pd.DataFrame({'day': [1, 2, 3, 4], 'month': [4, 6, 9, 9], 'year': [2018,␣

↪2018, 2020, 2020]})

display(df)

pd.to_datetime(df)

229 https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
230 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html

218 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

day month year

0 1 4 2018
1 2 6 2018
2 3 9 2020
3 4 9 2020

0 2018-04-01
1 2018-06-02
2 2020-09-03
3 2020-09-04
dtype: datetime64[ns]

The to_datetime231 function expects that columns are named 'day', 'month', 'year'. It returns a series
of type datetime64 which may be converted to a DatetimeIndex.

Indexing

Exact Indexing

Using Timestamp objects for label based indexing yields items with the corresponding time stamp, if there are any.
Slicing works as usual.

index = pd.date_range(start='2018-03-14', freq='D', periods=10)

s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], index=index)
print(s, '\n')

print(s.loc[pd.Timestamp('2018-3-16')], '\n')
#print(s.loc[pd.Timestamp('2018-3-16 10:00')], '\n') # KeyError
print(s.loc[pd.Timestamp('2018-3-16 00:00')], '\n')
print(s.loc[pd.Timestamp('2018-3-16'):pd.Timestamp('2018-3-20')])

2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
Freq: D, dtype: int64

2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
Freq: D, dtype: int64

231 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

18.4. Dates and Times 219

Data Science and Artificial Intelligence for Undergraduates

Inexact Indexing

Passing strings containing partial dates/times selects time ranges. This technique is referred to as partial string index-
ing. Slicing is allowed.

index = pd.date_range(start='2018-03-14', freq='D', periods=100)

s = pd.Series(range(1, len(index) + 1), index=index)
print(s, '\n')

print(s.loc['2018-3'], '\n')
print(s.loc['2018-3':'2018-4'])

2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
...
2018-06-17 96
2018-06-18 97
2018-06-19 98
2018-06-20 99
2018-06-21 100
Freq: D, Length: 100, dtype: int64

2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
2018-03-24 11
2018-03-25 12
2018-03-26 13
2018-03-27 14
2018-03-28 15
2018-03-29 16
2018-03-30 17
2018-03-31 18
Freq: D, dtype: int64

220 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

2018-03-30 17
2018-03-31 18
2018-04-01 19
2018-04-02 20
2018-04-03 21
2018-04-04 22
2018-04-05 23
2018-04-06 24
2018-04-07 25
2018-04-08 26
2018-04-09 27
2018-04-10 28
2018-04-11 29
2018-04-12 30
2018-04-13 31
2018-04-14 32
2018-04-15 33
2018-04-16 34
2018-04-17 35
2018-04-18 36
2018-04-19 37
2018-04-20 38
2018-04-21 39
2018-04-22 40
2018-04-23 41
2018-04-24 42
2018-04-25 43
2018-04-26 44
2018-04-27 45
2018-04-28 46
2018-04-29 47
2018-04-30 48
Freq: D, dtype: int64

Inexact indexing has some pitfalls, which are described in Partial string indexing232 and Slice vs. exact match233 of
the Pandas user guide.

Useful Functions for Time Stamp Indexed Data

Pandas provides lots of functions for working with time stamp indices. Some are:
• asfreq234 (upsampling with fill values or filling logic)
• shift235 (shift index or data by some time period)
• resample236 (downsampling with aggregation, see below)
The resample method returns a Resampler237 object, which provides several methods for calculating data values
at the new time stamps. Examples are sum, mean, min, max. All these methods return a series or a data frame.

index = pd.date_range(start='2018-03-14', freq='D', periods=100)

s = pd.Series(range(1, len(index) + 1), index=index)

s2 = s.resample('5D').sum()
s2

232 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing
233 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#slice-vs-exact-match
234 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.asfreq.html
235 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html
236 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html
237 https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html

18.4. Dates and Times 221

Data Science and Artificial Intelligence for Undergraduates

2018-03-14 15
2018-03-19 40
2018-03-24 65
2018-03-29 90
2018-04-03 115
2018-04-08 140
2018-04-13 165
2018-04-18 190
2018-04-23 215
2018-04-28 240
2018-05-03 265
2018-05-08 290
2018-05-13 315
2018-05-18 340
2018-05-23 365
2018-05-28 390
2018-06-02 415
2018-06-07 440
2018-06-12 465
2018-06-17 490
Freq: 5D, dtype: int64

18.4.5 Period indices

Period indices work analogously to time stamp indices. Corresponding class is PeriodIndex238 .

index = pd.period_range(start='2018-03-14', freq='D', periods=10)

print(index)

s = pd.Series(range(1, len(index) + 1), index=index)

PeriodIndex(['2018-03-14', '2018-03-15', '2018-03-16', '2018-03-17',

'2018-03-18', '2018-03-19', '2018-03-20', '2018-03-21',
'2018-03-22', '2018-03-23'],
dtype='period[D]')

2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
Freq: D, dtype: int64

Indexing with time stamps selects the appropriate period, like with IntervalIndex objects:

s.loc[pd.Timestamp('2018-03-15 12:34')]

238 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html

222 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

With time stamp index the above line would lead to a KeyError. But for periods it’s interpreted as: select the
period containing the time stamp.
Similar is possible with slicing:

s.loc[pd.Timestamp('2018-03-15 12:34'):pd.Timestamp('2018-03-18 23:45')]

2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
Freq: D, dtype: int64

Methods asfreq, shift, resample also work for periods indices.

18.5 Categorical Data

Next to numerical and string data one frequently encounters categorical data. That is data of whatever type with finite
range. Admissible values are called categories. There are two kinds of categorical data:
• nominal data (finitely many different values without any order)
• ordinal data (finitely many different values with linear order)
Examples:
• colors red, blue, green, yellow (nominal)
• business days Monday, Tuesday, Wednesday, Thursday, Friday (ordnial)
Pandas provides explicit support for categorical data and indices. Major advantages of categorical data compared to
string data are lower memory consumption and more meaningful source code.

import pandas as pd

18.5.1 Creating Categorical Data

Pandas has a class Categorical to hold a list of categorical data with (ordinal) or without (nominal) ordering.
Such Categorical objects can directly be converted to series or columns of a data frame. Almost always category
labels are strings, but any other data type is allowed, too.

cat_data = pd.Categorical(['red', 'green', 'blue', 'green', 'green'],

categories=['red', 'green', 'blue'], ordered=False)

s = pd.Series(cat_data)
s

0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['red', 'green', 'blue']

Passing dtype='category' to series or data frame constructors works, too. Categories then are determined
automatically.

18.5. Categorical Data 223

Data Science and Artificial Intelligence for Undergraduates

s = pd.Series(['red', 'green', 'blue', 'green', 'green'], dtype='category')

0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['blue', 'green', 'red']

Or we may convert an existing series or data frame column to categorical type.

s = pd.Series(['red', 'green', 'blue', 'green', 'green'])

s = s.astype('category')
s

0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['blue', 'green', 'red']

Automatically determined categories always are unordered (nominal).

Advantage of ordered categories is that we may use min and max functions for corresponding data.

quality = pd.Series(pd.Categorical(['poor', 'good', 'excellent', 'good', 'very␣

↪good', 'poor'],

categories=['very poor', 'poor', 'good', 'very␣

↪good', 'excellent'],

ordered=True))

print(quality.min())
print(quality.max())

poor
excellent

18.5.2 Custom Categorical Types

Instead of using general categorical data type we may define new categorical types. Strictly speaking cate-
gorical isn’t a well defined type because we have to provide the category labels to obtain a full-fledged data type. A
more natural way for using categories is to define a data type for each set of categories via CategoricalDtype239 .
A further advantage is that the same set of categories can be used for several series and data frames simultaneously.

colors = pd.CategoricalDtype(['red', 'green', 'blue', 'yellow'], ordered=False)

s = pd.Series(['red', 'red', 'black', 'blue'], dtype=colors)

239 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html

224 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

0 red
1 red
2 NaN
3 blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']

Values not covered by the categorical type are set to NaN.

18.5.3 Encoding Categorical Data for Machine Learning

Most machine learning algorithms expect numerical input. Thus, categorical data has to be converted to numerical
data first.
For ordinal data one might use numbers 1, 2, 3,… instead of the original category labels. But for nominal data the
natural ordering of integers adds artificial structure to the data, which might affect an algorithm’s behavior. Thus, one
hot encoding usually is used for converting nominal data to numerical data.
The idea is to replace a variable holding one of 𝑛 categories by 𝑛 boolean variables. Each new variable corresponds to
one category. Exactly one variable is set to True. Pandas supports this conversion via get_dummies240 function.

colors = pd.CategoricalDtype(['red', 'green', 'blue', 'yellow'], ordered=False)

s = pd.Series(['red', 'red', 'green', 'blue'], dtype=colors)

print(s)

df = pd.get_dummies(s)
df

0 red
1 red
2 green
3 blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']

red green blue yellow

0 1 0 0 0
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0

18.5.4 Modifying Categories

Series or data frame columns with categorical data have a cat member providing access to the set of categories.
Some member functions are:
• rename_categories241 (modify category labels),
• add_categories242 (add category; at the highest position, if ordinal),
• remove_categories243 (remove category, replacing corresponding items by nan),
• union_categories244 (join sets of categories).
240 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
241 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.rename_categories.html
242 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.add_categories.html
243 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.remove_categories.html
244 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html

18.5. Categorical Data 225

Data Science and Artificial Intelligence for Undergraduates

18.5.5 Categorical Data and CSV Files

Information about categories cannot be stored in CSV files. Instead, category labels are written to the CSV file in
their native data type. When reading CSV data to a data frame, columns have to be converted to categorical types
again, if desired.

18.5.6 Categorical Indices

Pandas supports categorical indices via CategoricalIndex objects. Simply pass a Categorical object as
index when creating a series or a data frame.

quality = pd.Categorical(['poor', 'good', 'excellent', 'good', 'very good', 'poor

↪'],

categories=['very poor', 'poor', 'good', 'very good',

↪'excellent'],

ordered=True)
s = pd.Series([3, 4, 2, 23, 41, 5], index=quality)
print(s, '\n')

s = s.sort_index()
s

poor 3
good 4
excellent 2
good 23
very good 41
poor 5
dtype: int64

poor 3
poor 5
good 4
good 23
very good 41
excellent 2
dtype: int64

Data access works as usual.

print(s.loc['poor'], '\n')
print(s.loc['poor':'very good'])

poor 3
poor 5
dtype: int64

poor 3
poor 5
good 4
good 23
very good 41
dtype: int64

226 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

18.5.7 Categories by Binning

Continuous data or discrete data with too large range can be converted to categories by providing a list of intervals
(bins) in which items shall be placed. Each bin can be regarded as a category. Binning is important for machine
learning tasks which require discrete data. The pd.cut245 function implements binning.

18.6 Restructuring Data

Restructuring and aggregation are two basic methods for extracting statistical information from data. We start with
groupwise aggregation and then discuss several forms of restructuring without and with additional aggregation.

import pandas as pd

18.6.1 Grouping

Grouping is the first step in the so-called split-apply-combine procedure in data processing. Data is split into groups
by some criterion, then some function is applied to each group, finally results get (re-)combinded. Typical functions
in the apply step are sum or mean (more general: aggregation) or any type of transform or filtering functions (drop
groups containing nan items, for instance).
This chapter follows the structure of the Pandas user guide246 , but leaves out sections on very specific details. Feel
free to have a look at those details later on.

Splitting into Groups and Basic Usage

Grouping is done by calling the groupby247 method of a series or data frame. It takes a column label or a list of
column labels as argument and returns a SeriesGroupBy or DataFrameGroupBy object. The returned object
represents a kind of list of groups, each group being a small series or data frame. All rows in a group have identical
values in the columns used for grouping.
The ...GroupBy object offers several methods for working with the determined groups. Iterating over such objects
is possible, too.
Grouping by one column and subsequent aggregation yields an index with values from the column used for grouping:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

for name, group in g:

print('age:', name)
display(group)

df_means = g.mean()
df_means

age score
0 2 2.3
1 3 4.5
2 3 3.4
(continues on next page)
245 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
246 https://pandas.pydata.org/docs/user_guide/groupby.html
247 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

18.6. Restructuring Data 227

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

3 2 2.0
4 4 5.4
5 5 7.2
6 5 2.8
7 5 3.9

age: 2

age score
0 2 2.3
3 2 2.0

age: 3

age score
1 3 4.5
2 3 3.4

age: 4

age score
4 4 5.4

age: 5

age score
5 5 7.2
6 5 2.8
7 5 3.9

score
age
2 2.150000
3 3.950000
4 5.400000
5 4.633333

Grouping by two columns and subsequent aggregation yields a multi-level index:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby(['age', 'answer'])

for name, group in g:

print('age:', name[0])
print('answer:', name[1])
display(group)

df_means = g.mean()
display(df_means)

228 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

age answer score

0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

age: 2
answer: no

age answer score

3 2 no 2.0

age: 2
answer: yes

age answer score

0 2 yes 2.3

age: 3
answer: no

age answer score

1 3 no 4.5
2 3 no 3.4

age: 4
answer: no

age answer score

4 4 no 5.4

age: 5
answer: no

age answer score

7 5 no 3.9

age: 5
answer: yes

age answer score

5 5 yes 7.2
6 5 yes 2.8

score
age answer
2 no 2.00
yes 2.30
(continues on next page)

18.6. Restructuring Data 229

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

3 no 3.95
4 no 5.40
5 no 3.90
yes 5.00

Grouping by levels of a multi-level index is possible by providing the level argument to groupby.
With get_group we have access to single groups:

g.get_group((5, 'yes'))

age answer score

5 5 yes 7.2
6 5 yes 2.8

DataFrameGroupBy objects allow for column indexing:

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

g['answer'].get_group(5)

age answer score

0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

5 yes
6 yes
7 no
Name: answer, dtype: object

Aggregation

To apply a function to each column of each group use aggregate248 . It takes a function or a list of functions as
argument. Providing a dictionary of column: function pairs allows for column specific functions.

import numpy as np

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

(continues on next page)

248https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html#pandas.core.groupby.
DataFrameGroupBy.aggregate

230 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

g = df.groupby('age')

display(g.aggregate(np.min))
display(g.aggregate([np.min, np.max]))
display(g.aggregate({'answer': np.min, 'score': np.mean}))

age answer score

0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

answer score
age
2 no 2.0
3 no 3.4
4 no 5.4
5 no 2.8

answer score
amin amax amin amax
age
2 no yes 2.0 2.3
3 no no 3.4 4.5
4 no no 5.4 5.4
5 no yes 2.8 7.2

answer score
age
2 no 2.150000
3 no 3.950000
4 no 5.400000
5 no 4.633333

With size we get group sizes.

g.size()

age
2 2
3 2
4 1
5 3
dtype: int64

Many aggregation functions are directly accessible from the ...GroupBy object. Examples are ...GroupBy.
sum and ...GroupBy.mean. See Computations / descriptive stats249 for a complete list.
249 https://pandas.pydata.org/docs/reference/groupby.html#computations-descriptive-stats

18.6. Restructuring Data 231

Data Science and Artificial Intelligence for Undergraduates

Transformation

The transform250 method allows to transform rows groupwise resulting in a data frame with same shape as the
original one.

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

# substract the groups mean score in each age group

df['score'] = g['score'].transform(lambda score: score - score.mean())
df

age answer score

0 2 yes 2.3
1 3 no 4.5
2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

age answer score

0 2 yes 0.150000
1 3 no 0.550000
2 3 no -0.550000
3 2 no -0.150000
4 4 no 0.000000
5 5 yes 2.566667
6 5 yes -1.833333
7 5 no -0.733333

Filtering

To remove groups use filter251 method. It takes a function as argument and returns a data frame with rows
belonging to removed groups removed. The passed function gets the group (series or data frame) and has to return
True (keep group) or False (remove group).

df = pd.DataFrame({'age': [2, 3, 3, 2, 4, 5, 5, 5],

'answer': ['yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'no'],
'score': [2.3, 4.5, 3.4, 2.0, 5.4, 7.2, 2.8, 3.9]})
display(df)

g = df.groupby('age')

g.filter(lambda dfg: dfg['score'].mean() > 4)

age answer score

0 2 yes 2.3
1 3 no 4.5
(continues on next page)
250 https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html#pandas.core.groupby.
DataFrameGroupBy.transform
251 https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html

232 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

2 3 no 3.4
3 2 no 2.0
4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

age answer score

4 4 no 5.4
5 5 yes 7.2
6 5 yes 2.8
7 5 no 3.9

18.6.2 Restructuring Without Aggregation

There are three basic techniques for restructuring data in a data frame:
• pivot252 (interprets two specified columns as row and column index)
• stack253 /unstack254 (move (level of) column index to (level of) row index and vice versa)
• melt255 (create new column from some column labels)
Details and graphical illustrations of these technique may be found in Pandas’ user guide256 (first three sections).

18.6.3 Restructuring With Aggregation

Pandas supports pivot tables via pivot_table257 function. Pivot tables are almost the same as pivoting with
pivot258 but allow for multiple values per data cell, which then are aggregated to one value.
Details may be found in Pandas’ user guide259 .
Similar functionality is provided by crosstab260 . See Pandas user guide261 , too.

18.7 Performance Issues

Similar to the discussion in Efficiency Considerations (page 173) for NumPy with Pandas we have to take care of how
we implement certain operations, at least if performance matters. NumPy guidlines carry over to Pandas, but some
additional remarks are in order.

import pandas as pd

252 https://pandas.pydata.org/docs/reference/api/pandas.pivot.html
253 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack
254 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.unstack
255 https://pandas.pydata.org/docs/reference/api/pandas.melt.html
256 https://pandas.pydata.org/docs/user_guide/reshaping.html
257 https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
258 https://pandas.pydata.org/docs/reference/api/pandas.pivot.html
259 https://pandas.pydata.org/docs/user_guide/reshaping.html#pivot-tables
260 https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
261 https://pandas.pydata.org/docs/user_guide/reshaping.html#cross-tabulations

18.7. Performance Issues 233

Data Science and Artificial Intelligence for Undergraduates

18.7.1 Vectorization

Analogously to NumPy, in Pandas we should avoid iterating over rows of series or data frames. Almost always
vectorization is possible. For numeric columns Pandas relies on NumPy’s vectorized function calls. For string and
date/time data Pandas implements tailor-made vectorization techniques.

18.7.2 Vectorized String Operations

Indices, series and data frame columns containing string data have a member str providing typical string operations.
Calling such a method applies the operation to each data item.

s = pd.Series(['abc', 'def', 'ghijklmn'])

s.str.upper()

0 ABC
1 DEF
2 GHIJKLMN
dtype: object

See Pandas’ user guide262 for a list of supported string operations.

18.7.3 Vectorized Date/Time Operations

Indices, series and data frame columns containing timestamp data have a member dt providing typical date/time
operations. Calling such a method applies the operation to each data item.

s = pd.Series([pd.Timestamp(2022, 12, 24), pd.Timestamp(2022, 12, 25), pd.

↪Timestamp(2022, 12, 26)])

s.dt.dayofweek

0 5
1 6
2 0
dtype: int64

See Pandas’ user guide263 for available methods.

18.7.4 Accelerating Code Execution

Pandas has a function eval264 which executes Python-like code provided as string. Due to (very complicated CPU
caching and other) optimization techniques eval is faster for long expressions involving large data frames than
standard Python code. The DataFrame.query265 method provides a simplified interface to eval for selecting
rows via boolean operations on columns.
Both methods should only be used for operations on very large data frames. For small data frames they are significantly
slower than standard Python. Have look at Expression evaluation via eval()266 in Pandas’ user guide for details.
262 https://pandas.pydata.org/docs/user_guide/text.html#method-summary
263 https://pandas.pydata.org/docs/user_guide/basics.html#dt-accessor
264 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html
265 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
266 https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#expression-evaluation-via-eval

234 Chapter 18. High-Level Data Management with Pandas

Data Science and Artificial Intelligence for Undergraduates

18.7.5 Very Large Data Sets

Sometimes data sets are too large to load the whole data set to memory. Pandas supports partial loading and there
are other Pandas-like Python libraries supporting data sets larger than memory.

Partial Loading

The pd.read_csv267 function supports chunking, that is, loading data in chunks. After processing a chunk it gets
removed from memory and the next chunk can be read to memory. See Iterating through files chunk by chunk268 in
Pandas’ user guide.

Other Libraries

Dask269 is a parallel computing library with Pandas-like API. It allows for faster processing of large data sets. Have
a look at Use other libraries270 in Pandas’ user guide for a quick introduction.

267 https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
268 https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#iterating-through-files-chunk-by-chunk
269 https://www.dask.org/
270 https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#use-other-libraries

18.7. Performance Issues 235

Data Science and Artificial Intelligence for Undergraduates

236 Chapter 18. High-Level Data Management with Pandas

Part V

Exercises

237
CHAPTER

NINETEEN

COMPUTER BASICS

To solve these exercises on bits and bytes and representations of numbers you should have read Computers and
Programming (page 33).

19.1 Bits and Bytes

19.1.1 Hard Disk Capacity

A hard disk is advertised to have a capacity of 2 TB. Give the hard disk’s capacity in TiB.
Solution

# your solution

19.1.2 Uncompressed Image File

An image has a width of 4000 pixels and a height of 3000 pixels. For each pixel three color components (red, green,
blue) have to be saved, where 8 bits per component are required. How much space does the image need when saving
it to a file without compression?
Solution

# your solution

19.1.3 Books

A book with 500 pages and 1500 characters per page in average shall be saved to a file. Each character requires one
byte of disk space. What is the file’s total size?
A well known online encyclopedia has about 50 GB of English text (without images). How many books are needed
for a print version?
How long does it take to transfer 50 GB of data with a transfer rate of 100 megabit per second?
Solution

# your solution

239
Data Science and Artificial Intelligence for Undergraduates

19.1.4 Uncompressed Video File

How much storage is required for an uncompressed 120 minutes video file with 30 frames (that is, images) per second
and full HD resolution (1920x1080 pixels, 24 bits per pixel).
What’s the compression rate if the video file has size 4.1 GB?
Solution

# your solution

19.2 Representation of Numbers

19.2.1 Binary to Decimal

Write the binary numbers as decimal numbers.

• 10000001
• 11010010
Solution

# your solution

19.2.2 Decimal to Binary

Write the decimal numbers as binary numbers:

• 15
• 16
• 17
• 123
Solution

# your solution

19.2.3 Decimal to Hexadecimal

Write the decimal numbers as hexadecimal numbers:

• 31
• 32
• 33
• 234
• 257
Solution

# your solution

240 Chapter 19. Computer Basics

Data Science and Artificial Intelligence for Undergraduates

19.2.4 Binary to Hexadecimal

Write the 32-digit binary number 10001001 11111110 00001010 01001101 as hexadecimal number.
Solution

# your solution

19.3 Memory vs. Storage

Give two differences between a computer’s memory and a mass storage device.
Solution

# your solution

19.4 Compilers and Interpreters

What’s the difference between compiled and interpreted computer programs?

Solution

# your solution

19.3. Memory vs. Storage 241

Data Science and Artificial Intelligence for Undergraduates

242 Chapter 19. Computer Basics

CHAPTER

TWENTY

PYTHON PROGRAMMING

• Finding Errors (page 243)

• Basics (page 249)
• More Basics (page 251)
• Variables and Operators (page 253)
• Memory Management (page 256)
• Lists and Friends (page 259)
• Strings (page 262)
• File Access (page 264)
• Functions (page 265)
• Object-Oriented Programming (page 267)

20.1 Finding Errors

Computer programs contain lots of errors. Finding them is much more difficult than correcting them. In this series
of exercises you have to find syntax and semantic errors in small programs. Before you start you should have read
Building Blocks (page 46).

Hint: Simply run the programs and let the Python interpreter look for errors. Then correct all errors identified by
the interpreter. If still something is not working as expected, have a closer look at the source code.

20.1.1 Simple 1

print('Python is a programming language.')

print('Many people love Python's approach to programming.')
print('Maybe it's the first programming language you learn to use.')

Solution:

# your modifications

print('Python is a programming language.')

print('Many people love Python's approach to programming.')
print('Maybe it's the first programming language you learn to use.')

243
Data Science and Artificial Intelligence for Undergraduates

20.1.2 Simple 2

a = 2
b = 3

if a < b
print(a, '<', b)

Solution:

# your modifications

a = 2
b = 3

if a < b
print(a, '<', b)

20.1.3 Simple 3

def say_something(text):

print('I have to say:')

print(text)

say_samething('Python is great!')

Solution:

# your modifications

def say_something(text):

print('I have to say:')

print(text)

say_samething('Python is great!')

20.1.4 Simple 4

a = 2
b = 3

if a = b:
print(a, 'equals', b)
else:
print('not equal')

Solution:

# your modifications

a = 2
b = 3

(continues on next page)

244 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

if a = b:
print(a, 'equals', b)
else:
print('not equal')

20.1.5 Slightly More Difficult 1

primes = [2, 3, 5, 7, 11, 13]

print('third prime number is', primes(2))

Solution:

# your modifications

primes = [2, 3, 5, 7, 11, 13]

print('third prime number is', primes(2))

20.1.6 Slightly More Difficult 2

problem = '5 + 4'

solution = '5' + '4'

print(problem, '=', solution)

Solution:

# your modifications

problem = '5 + 4'

solution = '5' + '4'

print(problem, '=', solution)

20.1.7 Slightly More Difficult 3

my_list = [1, 3, 4, 2, 6]

print('last item of my list is', my_list[5])

Solution:

# your modifications

my_list = [1, 3, 4, 2, 6]

print('last item of my list is', my_list[5])

20.1. Finding Errors 245

Data Science and Artificial Intelligence for Undergraduates

20.1.8 Slightly More Difficult 4

last = 5

print('printing all square numbers up to square of', last)

for k in range(1, last):
print(k + k)

Solution:

# your modifications

last = 5

print('printing all square numbers up to square of', last)

for k in range(1, last):
print(k + k)

20.1.9 More Difficult 1

print('Adding zero to 5 doesn't change anything:')

print(5 + O)

Solution:

# your modifications

print('Adding zero to 5 doesn't change anything:')

print(5 + O)

20.1.10 More Difficult 2

l = [2, 3, 2, 5, 7, 8, 9]

for i in range(0, len(l)):

print('product of item', i, 'and item', i + 1, 'is', l[i] * l[i + 1])

Solution:

# your modifications

l = [2, 3, 2, 5, 7, 8, 9]

for i in range(0, len(l)):

print('product of item', i, 'and item', i + 1, 'is', l[i] * l[i + 1])

246 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.1.11 More Difficult 3

l = [2, 4, 6, 3]

print('let\'s print the list', l, 'item by item:')

for i in range(1, len(l)):
print(l[i - 1])

Solution:

# your modifications

l = [2, 4, 6, 3]

print('let\'s print the list', l, 'item by item:')

for i in range(1, len(l)):
print(l[i - 1])

20.1.12 Difficult 1

my_lists = [[1, 2, 3], [4, 2, 5, 6, 7], [1, 4], [0]]

total_sum = 0
for i in range(0, len(my_lists)):
for j in range(0, len(my_lists[0])):
total_sum = total_sum + my_lists[j][i]

print(total_sum)

Solution:

# your modifications

my_lists = [[1, 2, 3], [4, 2, 5, 6, 7], [1, 4], [0]]

total_sum = 0
for i in range(0, len(my_lists)):
for j in range(0, len(my_lists[0])):
total_sum = total_sum + my_lists[j][i]

print(total_sum)

20.1.13 Difficult 2

last = 19

print('detecting prime numbers below or equal to', last)

for n in range(2, last):
for k in range(2, n / 2):
if n % k != 0:
break
else:
print(n)

Solution:

20.1. Finding Errors 247

Data Science and Artificial Intelligence for Undergraduates

# your modifications

last = 19

print('detecting prime numbers below or equal to', last)

for n in range(2, last):
for k in range(2, n / 2):
if n % k != 0:
break
else:
print(n)

20.1.14 Difficult 3

def print_first_half(l):
'''print first half of list, include center item if length of list is odd'''

print('first half of', l, 'is')

for i in range(0, len(l) / 2):
print(l[i])

def print_second_half(l):
'''print second half of list, omit center item if length of list is odd'''

print('second half of', l, 'is')

for i in range(len(l) / 2, len(l)):
print(l[i])

l = [2, 4, 3, 6, 8, 5, 7]
print_first_half(l)
print_second_half(l)

l = [2, 4, 3, 6, 8, 5]
print_first_half(l)
print_second_half(l)

l = [1]
print_first_half(l)
print_second_half(l)

l = []
print_first_half(l)
print_second_half(l)

Solution:

# your modifications

def print_first_half(l):
'''print first half of list, include center item if length of list is odd'''

print('first half of', l, 'is')

for i in range(0, len(l) / 2):
print(l[i])

def print_second_half(l):
(continues on next page)

248 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

'''print second half of list, omit center item if length of list is odd'''

print('second half of', l, 'is')

for i in range(len(l) / 2, len(l)):
print(l[i])

l = [2, 4, 3, 6, 8, 5, 7]
print_first_half(l)
print_second_half(l)

l = [2, 4, 3, 6, 8, 5]
print_first_half(l)
print_second_half(l)

l = [1]
print_first_half(l)
print_second_half(l)

l = []
print_first_half(l)
print_second_half(l)

20.2 Basics

Before solving these programming exercises you should have read Building Blocks (page 46). Only use Python features
discussed there.

20.2.1 Python as a Calculator 1

If light travels 300 million meters per second, how many kilometers travels light in one hour?
Solution:

# your solution

20.2.2 Python as a Calculator 2

A 20 years old person started watching web videos at the age of 8. Everyday the person watches videos for 2 hours.
If the person would have watched videos the same total number of hours, but all the day instead of only 2 hours, how
many years of its young life would the person have wasted?
Take into account, that the person uses approximately 7 hours for sleeping and 3 ours for eating, grooming, and doing
housework. Thus, time for watching videos is less than 24 hours a day.
Assume that each year has 365 days.
Solution:

# your solution

20.2. Basics 249

Data Science and Artificial Intelligence for Undergraduates

20.2.3 Integer Division

Get two integers from the user. Divide the first by the second. Use floor division and show the remainder, too. Avoid
ZeroDivisionError.
Solution:

# your solution

20.2.4 Conditional Execution 1

Get three integers from the user and print a message if they are not in ascending order.
Solution:

# your solution

20.2.5 Conditional Execution 2

Get three integers from the user und print them in ascending order.
Solution:

# your solution

20.2.6 Functions 1

Write a function is_ascending which checks whether its three numeric arguments are in ascending order. Return
a boolean value.
Test the function with an ascending sequence and a non-ascending sequence.
Solution:

# your solution

20.2.7 Functions 2

Write a function cut_off_decimals which takes two arguments, a float and a positive integer. The function shall
cut off all but the specified number of decimal places of the float and then return the float. If the second argument is
negative, the float shall remain untouched.
Note, that floor devision also works for floats. Thus, x // 1 cuts off all decimal places.
Test the function with 1.2345 and 2 decimal places.
Solution:

# your solution

250 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.2.8 Loops 1

Print the numbers 0 to 9 with a while loop. In other words: Use a while loop to simulate a for loop.
Solution:

# your solution

20.2.9 Loops 2

Take a list and print its items in reverse order. Test your code with [-1, 2, -3, 4, -5].
Solution:

# your solution

20.2.10 Loops 3

Take a list and print it item by item. After each item ask the user whether he wants to see the next item. If not, stop
printing. If yes, go on.
Solution:

# your solution

20.2.11 Loops 4

Take a list and print it item by item. After each item ask the user whether he wants to see the next item. If not, stop
printing. If yes, go on. If there are no more items left, start again with the first item.
Solution:

# your solution

20.3 More Basics

Solving this set of exercises increases your skills in algorithmic thinking and Python’s syntax. Everything you need
has been discussed in the Crash Course (page 43) chapter. Do not use additional Python features or modules.

20.3.1 Point Inside Rectangle?

Get two integers from the user and check whether the corresponding point lies inside the rectangle with corners at
(-1, -1), (5, -1), (5, 2), (-1, 2). Print a message showing the result.
Solution:

# your solution

20.3. More Basics 251

Data Science and Artificial Intelligence for Undergraduates

20.3.2 Square Numbers

Get an integer from the user and tell the user whether it’s a square number or not. If you want to compute square
roots, use 123 ** 0.5. Print a message if the user gave a negative number.
Hint: Have a look at the output of 16.125 % 1.
Solution:

# your solution

20.3.3 Unique Items

Write a function no_duplicates which returns True if the passed list contains no duplicates and False if
there are duplicates.
Test your function with [1, 4, 5, 6, 3] and [1, 3, 1] (and with [], of course).
Solution:

# your solution

20.3.4 Increasing Subsequence

Write a function inc_subseq which takes a list of numbers and prints all items except the ones which are smaller
than their predecessor.
Test your function (at least) with [1, 3, 2, 3, 4, -2, 9] and [3, 2, 1].
Solution:

# your solution

20.3.5 Area of a Circle

Get an integer radius of a circle from the user. Calculate and print the circle’s area as well as the edge length of a
square with identical area. Check user input for validity. You may use NumPy’s pi constant.
Solution:

# your solution

20.3.6 Quadratic Equations

Solve the quadratic equation 𝑎 𝑥2 + 𝑏 𝑥 + 𝑐 = 0 with user-specified 𝑎, 𝑏, 𝑐 (integers). Give all real solutions.
Solution:

# your solution

252 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.3.7 Regular Polygons

Use Matplotlib and NumPy’s sin and cos functions to plot a regular polygon. Ask the user for the number of
vertices and check user input for validity.
Hint: For 𝑛 vertices the 𝑘th vertex is at (cos 𝜑, sin 𝜑) with 𝜑 = 2 𝜋 𝑛𝑘 .
Solution:

# your solution

20.3.8 Stars

Draw a star with user-specified number of outside vertices. Radius for inner vertices is 0.3, for outer vertices it’s 1.
Solution:

# your solution

20.4 Variables and Operators

Python’s approach to variables and operators is simple and beautiful, although beginners need some time to see both
simplicity and beauty. In each exercise below keep track of what’s happening in detail behind the scenes. That is,
track the creation of objects and which name is tied to which object. Before solving the exercises read the chapter on
Variables and Operators (page 75) (sections Operators as Member Functions (page 88) and Efficiency (page 89) are
not required here).

20.4.1 Global vs. Local Variables 1

The following code contains an error. Find it (by running the code) and correct it.

def do_something():
n = n + 1
print('something')

n = 0 # counter for function calls

for i in range(0, 100):

if i % 11 == 0:
do_something()

print('function called', n, 'times')

Solution:

# your modifications

def do_something():
n = n + 1
print('something')

n = 0 # counter for function calls

for i in range(0, 100):

if i % 11 == 0:
(continues on next page)

20.4. Variables and Operators 253

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

do_something()

print('function called', n, 'times')

20.4.2 Global vs. Local Variables 2

The following code shall print 10 lines each containing 5 numbers. Make it work correctly.

def print_n_times_5_numbers(m, n, o, p, q):

for k in range(0, n):
print(m, n, o, p, q)

n = 10 # number of rows to print

print('printing', n, 'times 5 numbers:')

print_n_times_5_numbers(42, 23, 32, 24, 111)

Solution:

# your modifications

def print_n_times_5_numbers(m, n, o, p, q):

for k in range(0, n):
print(m, n, o, p, q)

n = 10 # number of rows to print

print('printing', n, 'times 5 numbers:')

print_n_times_5_numbers(42, 23, 32, 24, 111)

printing 10 times 5 numbers:

42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111
42 23 32 24 111

254 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.4.3 List of Squares

Make the following code work correctly.

a = [1, 2, 3, 4, 5]

squares = a
for i in range(0, len(squares)):
squares[i] **= 2

print('squares of', a, 'are:')

print(squares)

Solution:

# your modifications

a = [1, 2, 3, 4, 5]

squares = a
for i in range(0, len(squares)):
squares[i] **= 2

print('squares of', a, 'are:')

print(squares)

squares of [1, 4, 9, 16, 25] are:

[1, 4, 9, 16, 25]

20.4.4 Similar Code, Different Results

Why do the following two code cells yield different results? Explain in detail what’s happening!

a = 2
b = a
a = 5
print(b)

a = [2]
b = a
a[0] = 5
print(b[0])

Solution:

# your answer

20.4. Variables and Operators 255

Data Science and Artificial Intelligence for Undergraduates

20.4.5 2 > 3?

Guess why the condition 2 <= 3 == True evaluates to False. How to repair?

if 2 <= 3 == True:
print('2 <= 3')
else:
print('2 > 3, really?')

Solution:

# your modifications

if 2 <= 3 == True:
print('2 <= 3')
else:
print('2 > 3, really?')

2 > 3, really?

20.4.6 Minus Minus

Why do the following two code cells yield different results?

a = 5
b = 2
c = 3
print(a - b - c)

a = 5
b = 2
c = 3
a -= b - c
print(a)

Solution:

# your answer

20.5 Memory Management

Here you find some exercises on Python’s optimization techniques for memory management discussed in Efficiency
(page 89).
The last two tasks demonstrate effects of running out of memory and how to prevent such situations. Read Garbage
Collection (page 91) first.

256 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.5.1 Similar Code, Different Results 1

Why do the following two code cells yield different results?

a = 100
b = 2 * a
c = 2 * a
print(b is c)

True

a = 100
b = 3 * a
c = 3 * a
print(b is c)

False

Solution:

# your answer

20.5.2 Similar Code, Different Results 2

Copy the following code to a text file and feed the file to the Python interpreter. Why do both variants yield different
results? Why running in Jupyter yields identical results in both cases?

# variant 1
a = 1234
b = 1234
print(a is b)

# variant 2
c = 34
a = 1200 + c
b = 1200 + c
print(a is b)

False
False

Solution:

# your answer

20.5. Memory Management 257

Data Science and Artificial Intelligence for Undergraduates

20.5.3 Some More Code Optimization

Copy the following code to a text file and feed the file to the Python interpreter. Do you have an idea why the next
code yields True?

a = 1200 + 34
b = 1200 + 34
print(a is b)

False

Solution:

# your answer

20.5.4 Running Out of Memory

Run the following code and observe what happens to your system. Use some system tool to monitor memory con-
sumption. You may interrupt the Python kernel in Jupyter if necessary.

Warning: Save all your data you are currently editing. Depending on your system you may have to reboot your
machine due to hanging the system.

stop = False
data = 'x'

while not stop:

print('Hit Return to stop. Type an int to multiply memory consumption with.')

fac = input()

if fac == '':
stop = True
else:
fac = int(fac)
print('increasing memory usage to', fac * len(data), 'bytes...')
new_data = ''
for i in range(0, fac):
new_data += data
data = new_data
del new_data
print('...done')

del data

Note the del data line. This line is not required if the program is run as a stand-alone program (text file fed to the
interpreter), because at exit the interpreter will free all memory used by the program. But in Jupyter the interpreter
does not stop at the end of a code cell. All objects created by the cell remain in memory until the kernel dies or
memory is explicitly freed via del.
Solution:

# your answer

258 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.5.5 Managing Memory

Write a function which returns a string with approximately 1 billion characters. Then write a program which can
manage three data sets. Repeatedly ask the user whether he or she wants the exit the program or whether he or she
wants to load/remove data set 1/2/3. Load and remove data sets according to the user’s choice. Monitor memory
consumption while running the program to see whether the program works as intended.
Solution:

# your solution

20.6 Lists and Friends

To solve these exercises you yould have read Lists and Friends (page 93). Only use features discussed there or in
previous chapters.

20.6.1 Bad Coding Style

Consider the following code snipped. What numbers appear on screen when printing a, b, e, g, and h after executing
the code? Don’t run the code, interpret each line manually.

a = 1
b = 2
c = [a, b]
c[0] = 3
d = c
d[1] = 4
c.append(5)
e = c[-1]
f = c[0:-1]
g = f[-1]
h = d[0]

Solution:

# your answer

20.6.2 Squares and Sums of List of Lists

The following code yields incorrect outputs. Find the problem and solve it.

a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
d = [a, b, c]

# compute squares
for i in range(0, 3):
for j in range(0, 3):
d[i][j] **= 2

# compute row sums

sum_a = a[0] + a[1] + a[2]
sum_b = b[0] + b[1] + b[2]
sum_c = c[0] + c[1] + c[2]
(continues on next page)

20.6. Lists and Friends 259

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

# print results
print('squares of d:', d)
print('sum of a:', sum_a)
print('sum of b:', sum_b)
print('sum of c:', sum_c)

Solution:

# your modifications

a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
d = [a, b, c]

# compute squares
for i in range(0, 3):
for j in range(0, 3):
d[i][j] **= 2

# compute row sums

sum_a = a[0] + a[1] + a[2]
sum_b = b[0] + b[1] + b[2]
sum_c = c[0] + c[1] + c[2]

# print results
print('squares of d:', d)
print('sum of a:', sum_a)
print('sum of b:', sum_b)
print('sum of c:', sum_c)

squares of d: [[1, 4, 9], [16, 25, 36], [49, 64, 81]]

sum of a: 14
sum of b: 77
sum of c: 194

20.6.3 Slicing Instead of Loops 1

Write a function shift which takes a list, increases each item’s index by one (last item becomes first one), and
returns the resulting list. Example: [1, 3, 5, 7]should become [7, 1, 3, 5]. Don’t use loops, but slicing
syntax.
Solution:

# your solution

260 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.6.4 Slicing Instead of Loops 2

Write a function shift_n which takes a list and a positive integer n and shifts the list n times (cf. previous task).
Don’t use loops. Do not forget to think about the case that n is larger than the length of the list. Examples:
• shift_n([1, 2, 3, 4, 5], 3) should be [3, 4, 5, 1, 2].
• shift_n([1, 2, 3, 4, 5], 5) should be [1, 2, 3, 4, 5].
• shift_n([1, 2, 3, 4, 5], 6) should be [5, 1, 2, 3, 4].
Solution:

# your solution

20.6.5 Dictionary from Lists 1

Given two lists keys and values create a dictionary. Use a for loop to fill an empty dictionary item by item. Test
case:

keys = ['a', 'b', 'c', 'd']

values = [1, 2, 3, 4]

Solution:

# your solution

20.6.6 Dictionary from Lists 2

Given two lists keys and values create a dictionary. Use a dictionary comprehension. Test case:

keys = ['a', 'b', 'c', 'd']

values = [1, 2, 3, 4]

Solution:

# your solution

20.6.7 Dictionary from Lists 3

Given two lists keys and values create a dictionary. Call dict and pass a list of key-value pairs. Test case:

keys = ['a', 'b', 'c', 'd']

values = [1, 2, 3, 4]

Solution:

# your solution

20.6. Lists and Friends 261

Data Science and Artificial Intelligence for Undergraduates

20.6.8 One-Liner 1

Given a list of numbers, write one line of code to create a new list containing only numbers greater than 3. Test case:

[4, 3, 2, 8, 6, 0, 4, 6, -2, 1]

Solution:

# your solution

20.6.9 One-Liner 2

In one line of code create a new list containing every second number from a given list, if the number is between 1
and 10 (both included). Test case:

[-2, 3, 2, 6, 23, 1, 42, 42, 5, 10, 1, 12, 6, 4, 3]

Solution:

# your solution

20.6.10 One-Liner 3

Write one line of code to square all numbers in a list of lists of numbers. Test case:

[[1, 2, 3], [7, 6, 5, 4], [8, 9]]

Solution:

# your solution

20.6.11 Manual Deep Copying

Make a copy of a list of lists of numbers. In the end, changing a number in the copy must not modify the original
numbers. Test case:

[[1, 2, 3], [7, 6, 5, 4], [8, 9]]

Solution:

# your solution

20.7 Strings

Read Strings (page 105) before you start with the exercises.

262 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.7.1 Character Statistics

Print a list of all characters appearing in a string. Count how often each character appears without using str.
count. Get the string from user input.
Solution:

# your solution

20.7.2 Unicode Fruits

Print the string I like apples and melons. where apples and melons shall be replaced by a suitable
Unicode symbol.
Solution:

# your solution

20.7.3 Parser

Write a function which converts a string representation of a table of integers to a list of lists of integers. The elements
of a row are separated by commas and rows are separated by semicolons. Test your function with

'1,2,3;4,5,6;7,8,9'

Result should be

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Bonus: Solve the task in one line.

Solution:

# your solution

20.7.4 Spoon Language

Write a Python script which prompts the user for some input and translate user input to spoon language271 .
Hint: Simple replacement does not work. If, for instance, at first a is replaced by alewa, subsequent replacement of e
will alter the lew in alewa. There seems to be no replacement order for vowels working correctly in each case. Thus,
in a first run replace all vowels by some code (unlikely to appear in a text), then, in a second run, replace the codes
by the vowel’s lew-version.
Solution:

# your solution

271 https://de.wikipedia.org/wiki/Spielsprache#L%C3%B6ffelsprache

20.7. Strings 263

Data Science and Artificial Intelligence for Undergraduates

20.8 File Access

Read Accessing Data (page 111) before you start with the exercises.

20.8.1 Lower Case Copy

Read some text file’s content, convert it to lower case and save it to a new text file.
Solution:

# your solution

20.8.2 Reading CSV Files

Get a CSV file containing all public trees at Chemnitz from the Open data portal of Chemnitz272 . Read the first 10
lines from the file and show them on screen.
Hint: If you encounter cumbersome symbols in the output, have a look at byte order marks273 at Wikipedia.
Solution:

# your solution

20.8.3 Reading ZIP files

Get ‘The Blog Authorship Corpus’ from the web. The original source https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
vanished in 2022. Use https://www.fh-zwickau.de/~jef19jdw/datasets/blogs.zip. The ZIP file contains an info.
txt file and the original ZIP file.
Read infos.txt to get information about the file name format used in the ZIP file.
Write a Python program which extracts all 5 features from the file names and saves them in a CSV file.
Solution:

# your solution

20.8.4 Reading XML Files

Open the file 7596.male.26.Internet.Scorpio.xml from ‘The Blog Authorship Corpus’ (see exercise
above) without extracting it explicitly. Print the first and the last post in the file to screen.
Solution:

# your solution

272 http://portal-chemnitz.opendata.arcgis.com
273 https://en.wikipedia.org/wiki/Byte_order_mark

264 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.9 Functions

Read Functions (page 127) before you start with the exercises.

20.9.1 Inplace Squaring

Write a function which squares all numbers in a list. The list of numbers is the only parameter and there is no return
value. The results have to be stored in the original list.
Test your code with

[1, 2, 3, 4, 5]

Solution:

# your solution

20.9.2 Arbitrary Keyword Arguments

Write a function taking an arbitrary number of keyword arguments and printing all of them.
Test your code by passing

a=1, b='test', c=(1, 2, 3)

This should yield the output

a: 1
b: test
c: (1, 2, 3)

Solution:

# your solution

20.9.3 Apply to All

Write a function apply_to_all which takes an arbitrary number of arguments.

• The first argument is a function taking a float and returning a float.
• All other arguments are floats.
The apply_to_all function shall apply the first argument (function) to all other arguments (floats). Results shall
be returned as tuple of floats.
Test your code with a function, which calculates the square of a number. So calling apply_to_all(lambda
x: x * x, 3, 2, 5) yields (9, 4, 25).
Hint: For apply_to_all one short line of code suffices, but you may use more.
Solution:

# your solution

20.9. Functions 265

Data Science and Artificial Intelligence for Undergraduates

20.9.4 Composition

Write a function apply_composition which takes an arbitrary number of positional arguments.

• The first argument is mandatory. It’s a float.
• All other arguments are optional. They are real functions of a real variable, which shall be applied to the first
argument.
Example: If three functions are passed to your function, then the first is to be applied to the float, the second is to be
applied to the result of the first function, and the third is to be applied to the result of the second function.
Test your code with 23.42 and the following functions:
• square a number
• add 3
• sine (from math module)
• multiply by 2
Result should be -1.9784623024455807.
Solution:

# your solution

20.9.5 Sorting

Sort a list of paths by file name. Use list.sort with custom sort key.
Test your code with

['/some/path/file.txt',
'/another/path/xfile.txt',
'file_without_path.xyz',
'../relative/path/abc.py',
'/no/extension/some_file.txt']

Result should be

['../relative/path/abc.py',
'/some/path/file.txt',
'file_without_path.xyz',
'/no/extension/some_file.txt',
'/another/path/xfile.txt']

Solution:

# your solution

266 Chapter 20. Python Programming

Data Science and Artificial Intelligence for Undergraduates

20.9.6 Loop versus Recursion

Given an integer 𝑛 calculate 𝑛! (see Combinatorics (page 331) for a definition) using a loop. Then calculate 𝑛! by
exploiting the recursive rule 𝑛! = 𝑛 ⋅ (𝑛 − 1)!. Test both functions with 10! = 3628800.
Solution:

# your solution

20.10 Object-Oriented Programming

Read Inheritance (page 143) before you start with the exercises.

20.10.1 Abstract Animals

Create a class Animal encapsulating an animal. Members:

• variable legs (number of legs provided to the constructor),
• variable weight (weight of the animal provided to the constructor,
• function is_lightweight (returns True if weight is below 10 kg, False else),
• function show_up (show an ASCII art274 representation of the animal),
• function say_something (a virtual method).
The show_up method has to take into account the legs variable. Up to the number of legs we do not know
anything about the animal’s appearance. Be creativ and print an abstract animal with the correct number of legs!
No idea? What about this 8-legged abstract animal:

(########):
//||||\\

Bonus: Everytime an Animal object is created a message ‘An animal with … legs is hiding somewhere.’
Test your code by creating and showing animals with 0, 1,…, 8 legs.
Solution:

# your solution

20.10.2 Dogs

Derive a class Dog from the Animal class (cf. exercise above). Implement a proper say_something method
(‘Wau’, for instance) and reimplement show_up to show a dog instead of an abstract animal.
Note, that the constructor only takes the dog’s weight as argument. The constructor should call the base class’ con-
structor to set the correct number of legs (and show the bonus message, if implemented).
In your test code also check if the dog is lightweight.
Solution:

# your solution

274 https://en.wikipedia.org/wiki/ASCII_art

20.10. Object-Oriented Programming 267

Data Science and Artificial Intelligence for Undergraduates

20.10.3 Sitting Dog

Add methods sit_down and stand_up to your Dog class. Depending on which of both methods was called last,
show_up shall show a sitting or a standing dog.
Solution:

# your solution

20.10.4 Fish

Derive a Fish class from Animal. Add a member function to_sticks. After calling this function once,
show_up should show 10 fish sticks per kilogram of the fish instead of a live fish.
Don’t forget to implement say_something (maybe ‘Blub’ or, more realistic, ‘’).
Solution:

# your solution

268 Chapter 20. Python Programming

CHAPTER

TWENTYONE

MANAGING DATA

• NumPy Basics (page 269)

• Image Processing with NumPy (page 271)
• Pandas Basics (page 273)
• Pandas Indexing (page 276)
• Advanced Pandas (page 278)
• Pandas Vectorization (page 281)

21.1 NumPy Basics

Before solving these basic NumPy exercises you should have read NumPy Arrays (page 155), Array Operations
(page 161), Advanced Indexing (page 165), and Vectorization (page 166). Only use NumPy features discussed there.

import numpy as np

21.1.1 Maximum

Given two arrays print the largest value of all elements from both arrays. Use np.max or np.maximum or both.
Test your code with
1 2 3
⎛
⎜4 5 6⎞⎟ and (1 2 3) .
⎝7 8 9⎠
Solution:

# your solution

21.1.2 Largest Difference

Print the largest elementwise absolute difference between two equally shaped arrays.
Test your code with
1.3 2.4 1.25 2.34
( ) and ( )
3.5 6.5 3.499 6.55
Solution:

# your solution

269
Data Science and Artificial Intelligence for Undergraduates

21.1.3 Comparisons 1

Given an array of integers, test whether there is a one in the array. Print a message showing the result.
Test your code with
1 2 3 0 2 3
⎛
⎜4 5 6⎞⎟ and also with ⎛
⎜4 5 6⎞⎟.
⎝7 8 9⎠ ⎝7 8 9⎠
Solution:

# your solution

21.1.4 Comparisons 2

Test whether all elements of an array are positive. Print a message showing the result.
Test your code with
1 2 3 1 −2 3
⎛
⎜4 5 6⎞⎟ and also with ⎛
⎜−4 5 6⎞⎟.
⎝7 8 9⎠ ⎝7 8 9⎠
Solution:

# your solution

21.1.5 Indexing 1

Set all elements between -1 and 1 to zero. Remember: Avoid loops wherever possible.
Test you code with
−2 −0.5 0
⎛
⎜0.4 5 0.9⎞
⎟
⎝1 8 9. ⎠
Solution:

# your solution

21.1.6 Indexing 2

Write a function which takes an integer 𝑛 and returns an 𝑛 × 𝑛 matrix with ones at the borders and zeros inside. Data
type of the matrix should be int8.
Example for 𝑛 = 5:
1 1 1 1 1
⎛
⎜1 0 0 0 1⎞⎟
⎜
⎜ ⎟
⎜1 0 0 0 1⎟⎟ .
⎜
⎜1 ⎟
0 0 0 1⎟
⎝1 1 1 1 1⎠
Solution:

# your solution

270 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

21.1.7 Indexing 3

Write a function which takes an arbitrarily sized matrix and returns a matrix having the same elements as the original
matrix, but being bordered by additional zeros. The returned matrix should have the same data type as the original
matrix.
Example:
0 0 0 0 0
1 2 3 ⎛
⎜0 1 2 3 0⎞⎟
input: ( ), output: ⎜
⎜0 ⎟.
4 5 6 4 5 6 0⎟
⎝0 0 0 0 0⎠
Solution:

# your solution

21.1.8 Broadcasting

Take the first row of a matrix and add it to all other rows of the matrix. Print the resulting matrix.
Test your code with
1 2 3 4
⎛
⎜5 6 7 8⎞⎟.
⎝9 10 11 12⎠
Solution:

# your solution

21.1.9 Not as Easy as it Seems

Given a square matrix with odd number of rows/columns replace all but the boundary elements by zeros and write
the mean of all boundary elements to the center element. Print the modified matrix.
Example:

−1 2 −1 2 −1 −1 2 −1 2 −1
⎛
⎜ 2 −2 3 −2 2 ⎞
⎟ ⎛
⎜ 2 0 0 0 2⎞ ⎟
⎜
⎜ ⎟ ⎜ ⎟
original matrix: ⎜ −1 3 −3 3 −1⎟⎟ , result: ⎜
⎜ −1 0 0.5 0 −1⎟
⎟ .
⎜
⎜ 2 −2 3 ⎟ ⎜ ⎟
−2 2 ⎟ ⎜2 0 0 0 2⎟
⎝−1 2 −1 2 −1⎠ ⎝−1 2 −1 2 −1⎠

Solution:

# your solution

21.2 Image Processing with NumPy

Images can be represented by NumPy arrays with two (grayscale) or three (color) dimensions. Thus, basic image
processing like color transforms and cropping reduce to operations on NumPy arrays. Before solving the exercises you
should have read Efficient Computations with NumPy (page 155). Only use NumPy features, no additional modules.
To show results (images) on screen use Matplotlib.

import numpy as np
import matplotlib.pyplot as plt

21.2. Image Processing with NumPy 271

Data Science and Artificial Intelligence for Undergraduates

Call the following function whenever you want to see an image:

def show_image(img):
''' Show 2d or 3d NumPy array img as image (color range 0...1). '''

# check range
if img.min() < 0 or img.max() > 1:
print('Color values out of range!')
return

# check dims
if img.ndim == 2: # disable color normalization for grayscale
plt.imshow(img, vmin=0, vmax=1, cmap='gray')
elif img.ndim == 3:
plt.imshow(img)
else:
print('Wrong number of dimensions!')
return

plt.show()

Hint: Start with the first exercise and complete exercises one by one. Each exercise will use results from the previous
one.

21.2.1 Load Images

Load all images (NumPy arrays) contained in pasta.npz. The file contains three color images. Name the arrays
img1, img2, img3.
Print the images’ shapes and show the images with show_image. What’s the numeric range of the pixel values
(floats 0…1 or integers 0…255)?
Solution:

# your solution

21.2.2 Convert to Grayscale

Write a function rgb2gray which takes a color image (3d array) and returns a grayscale image (2d array). Calculate
gray levels as pixelwise mean of red, green and blue values.
Apply the function to the three images and show results.

Hint: NumPy has a mean function. The axis parameter could be of interest.

Solution:

# your solution

272 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

21.2.3 Cropping

Each image shows a piece of alphabet pasta on a perfect black background (color value 0). Write a function
auto_crop which finds the bounding box of a grayscale image and returns a copy (not a view!) of the corre-
sponding subarray. The bounding box is the rectangle that contains the pasta piece without black margin.
Apply the function to the grayscale pasta images and show the results.
Solution:

# your solution

21.2.4 Centering

Write a function center which takes a grayscale image and an integer n. The return value shall be a grayscale image
of size nxn with black background and the passed image positioned in the new images center (identical margin width
at all sides).
Place each cropped pasta image in a 50x50 image and show the results.
Solution:

# your solution

21.2.5 Vectorized Cropping

Implement the above auto_crop function without using loops, that is, fully vectorized. Compare execution times.
Solution:

# your solution

21.3 Pandas Basics

Before solving these basic Pandas exercises you should have read Series (page 188) and Data Frames (page 198).
For these exercises we use a dataset describing used cars obtained from kaggle.com275 . Licences: Open Data Com-
mons Database Contents License (DbCL) v1.0276 and Open Data Commons Open Database License (ODbL) 277 .

import pandas as pd

data = pd.read_csv('cars.csv')

275 https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
276 http://opendatacommons.org/licenses/dbcl/1.0/
277 https://opendatacommons.org/licenses/odbl/summary/

21.3. Pandas Basics 273

Data Science and Artificial Intelligence for Undergraduates

21.3.1 First Look

Basic Information

Print the following information about the data frame data:

• first 10 rows,
• number of rows,
• basic statistical information,
• column labels, data types, memory usage.
Solution:

# your solution

Missing Values

Are there missing values in data?

Solution:

# your answer

Value Counts

Use DataFrame.nunique278 to get the number of different values per column.

Solution:

# your solution

Unique Car Models

Use DataFrame.value_counts279 to get the number of unique 'name'-'year' combinations.

Solution:

# your solution

21.3.2 Restructure Columns

New Columns

Append a column 'manual_trans' containing True where column 'transmission' shows 'Manual',
else False.
Append a column 'age' showing a car’s age (now minus 'year').
Solution:

# your solution

278 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html
279 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

274 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

Remove Columns

Remove columns 'seller_type', 'transmission', and 'owner'.

Solution:

# your solution

21.3.3 Mean Price

Series with String Index

Create a Pandas series price with column 'name' as index and column 'selling_price' as data.
Solution:

# your solution

Mean

Calculate mean price for model 'Maruti Swift Dzire VDI'.

Solution:

# your solution

21.3.4 Kilometers per Year

Boolean Indexing

Use boolean row indexing to get a data frame one_model with columns 'km_driven' and 'age' containing
only rows with 'name' equal to 'Maruti Swift Dzire VDI'.
Solution:

# your solution

New Column

Add a column 'km_per_year' to the one_model data frame containing kilometers per year.
Solution:

# your solution

21.3. Pandas Basics 275

Data Science and Artificial Intelligence for Undergraduates

Mean

Get the mean of column 'km_per_year' in one_model.

Solution:

# your solution

21.3.5 Oldest Car

Find the oldest car in data and print its name and manufacturing year. Have a look at Pandas’ documentation280 for
suitable functions.
Solution:

# your solution

21.4 Pandas Indexing

Before solving these exercises you should have read Advanced Indexing (page 207) and Dates and Times (page 217).

import pandas as pd

21.4.1 Cars

For these exercises we use a dataset describing used cars obtained from kaggle.com281 . Licences: Open Data Com-
mons Database Contents License (DbCL) v1.0282 and Open Data Commons Open Database License (ODbL) 283 .

data = pd.read_csv('cars.csv')

Create Multi-Level Index

Create a multi-level index for the data frame from columns 'name' and 'year'.
Solution:

# your solution

280 https://pandas.pydata.org/docs/
281 https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
282 http://opendatacommons.org/licenses/dbcl/1.0/
283 https://opendatacommons.org/licenses/odbl/summary/

276 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

Select Model

Print all rows for the 'Maruti Swift Dzire VDI' 2018 model.
Solution:

# your solution

Diesel

Select all 2018 cars and use value_counts284 to get the percentage of Diesel cars.
Solution:

# your solution

Old Cars

Print all cars with more than 100000 kilometers driven and manufactured before 2000.
Solution:

# your solution

21.4.2 E-Mails

Consider an email account receiving emails every day. Use the following code to generate a list times of time
stamps representing arrival times of emails.

import numpy as np
rng = np.random.default_rng(0)

n_mails = 1000
start_time = pd.Timestamp('2019-01-01 00:00:00')
end_time = pd.Timestamp('2020-01-01 00:00:00')

total_seconds = int((end_time - start_time).total_seconds())

seconds = rng.integers(0, total_seconds, n_mails)
times = [start_time + pd.Timedelta(sec, unit='s') for sec in seconds]
del seconds

Mails per Day

Given the list of time stamps of incoming mails create a series with daily mail counts.
Solution:

# your solution

284 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

21.4. Pandas Indexing 277

Data Science and Artificial Intelligence for Undergraduates

Mails per Morning

Every day the user only answers mails received not after 7:00am that day. From the list of time stamps create a series
with daily mail counts at 7:00am. Hint: Have a look at the offset argument of Series.resample285 ; label
might be of interest, too.
Solution:

# your solution

Mails per Business Day Morning

Assume the user reads and answers emails at business days only (again, at 7:00am). Create a series containing the
numbers of mails to process at each business day.
Solution:

# your solution

Vacation

From the results of the previous task get the number of mails arriving during winter vacation in January and February.
Use a variable for the year of interest:

year = 2019

Write code which works for all years (leap year or not).
Solution:

# your solution

21.5 Advanced Pandas

Before solving these exercises you should have read Advanced Indexing (page 207), Dates and Times (page 217),
Categorical Data (page 223), and Restructuring Data (page 227).

import pandas as pd
import numpy as np

21.5.1 Grades

Use the following code to create a series containing student IDs as index and points in exam as data:

rng = np.random.default_rng(123)

n_students = 20
max_points = 40

student_ids = rng.integers(20000, 25000, n_students)

points = np.floor(rng.normal(0.6 * max_points, 0.2 * max_points, n_students))
(continues on next page)
285 https://pandas.pydata.org/docs/reference/api/pandas.Series.resample.html

278 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

(continued from previous page)

points = points.clip(0, max_points).astype(np.int8)
exam_points = pd.Series(points, index=student_ids)

exam_points

Understand the Code

What do the two points = ... lines in the above code do in detail?
Solution:

# your answer

Points to Grades

Add a column to the series (resulting in a data frame) containing corresponding grades. Conversion from points to
grade is as follows:

percent of points grade

less than 40 5.0
at least 40 4.0
at least 54 3.7
at least 60 3.3
at least 68 3.0
at least 74 2.7
at least 80 2.3
at least 84 2.0
at least 88 1.7
at least 92 1.3
at least 96 1.0

Use pd.cut286 to get the grades. Result should look as follows:

points grade
id
20077 24 3.3
23411 11 5.0
22964 33 2.3
...

The 'grade' column should be of categorical type.

Solution:

# your solution

286 https://pandas.pydata.org/docs/reference/api/pandas.cut.html

21.5. Advanced Pandas 279

Data Science and Artificial Intelligence for Undergraduates

Mean Grade

Get the mean grade for all students who passed the exam (grade better than 5).
Solution:

# your solution

21.5.2 Cafeteria

For these exercises we use the dataset obtained in the Cafeteria (page 315) project.

data = pd.read_csv('meals.csv', names=['date', 'category', 'name', 'students',

↪'staff', 'guests'])

data

Dates

Convert 'date' column to Timestamp. Hint: the pd.to_datetime287 function is very flexible.
Solution:

# your solution

Set type of 'category' column to categorical. Print all categories.

Solution:

# your solution

Mean Price per Category

Get mean students/staff/guests prices per category. Sort results by students price.
Solution:

# your solution

Prices over Time

Drop rows with nan or 0.0 prices. Then get minimum, average, maxmium students prices per day. Create a data
frame with three columns 'min', 'mean' 'max' and DatetimeIndex. Call the data frame’s plot288 method
(works without arguments) to visualize the results.
Solution:

# your solution

287 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
288 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

280 Chapter 21. Managing Data

Data Science and Artificial Intelligence for Undergraduates

Restructuring

Create a data frame showing prices only, no meal names. Use dates for row index. Column index shall be multi-level
with first level showing the category and second level showing the price level (students/staff/guests).
Solution:

# your solution

21.6 Pandas Vectorization

Before solving these exercises you should have read High-Level Data Management with Pandas (page 187).

import pandas as pd

For these exercises we use the dataset obtained in the Cafeteria (page 315) project.

data = pd.read_csv('meals.csv', names=['date', 'category', 'name', 'students',

↪'staff', 'guests'])

data

21.6.1 Preprocessing

Data Types

Convert 'date' and 'category' columns to Timestamp and category, respectively (see Advanced Pandas
(page 278) exercises).
Solution:

# your solution

Count Categories

For each category give the number of meals.

Solution:

# your solution

Remove Categories

Remove Categories which do not contain full-fledged meals descriptions (e.g., 'Salat Bar').
Solution:

# your solution

21.6. Pandas Vectorization 281

Data Science and Artificial Intelligence for Undergraduates

21.6.2 Remove Allergens and Additives

Most meals contain information on alergenes and additives (numbers in parantheses). Remove the information to
get more readably meal descriptions. Implement the removal procedure twice: without and with vectorized string
operations. Get and compare execution times.
Solution:

data['name_backup'] = data['name']

%%timeit
# your solution without vectorized string operations

data['name'] = data['name_backup']
data = data.drop(columns=['name_backup'])

%%timeit
# your solution with vectorized string operations

21.6.3 Simplify Meal Descriptions

Create a new column 'simple' from the 'name' column by removing all lower-case words, all punctuation marks
and so on. Only words starting with an upper-case letter are allowed.
Solution:

# your solution

21.6.4 All Meals with…

Given a key word (e.g., 'Kartoffel') get the number of meals containing the keyword and print all meal descrip-
tions.
Solution:

# your solution

21.6.5 Meal Plot

For each day get the number of meals containing some keyword (e.g., 'Kartoffel'). Call Series.plot to
visualize the result.
Solution:

# your solution

282 Chapter 21. Managing Data

Part VI

Projects

283
CHAPTER

TWENTYTWO

INSTALL AND USE PYTHON

There are lots of different ways for installing and using Python and related tools. Choose the one which suits your
needs and your approach to work.
• Working with JupyterLab (page 285)
• Install Jupyter Locally (page 289)
• Python Without Jupyter (page 292)
• Long-Running Tasks (page 295)

22.1 Working with JupyterLab

In this project you learn basic usage of JupyterLab289 , a web interface for Python programming. Read Python and
Jupyter (page 27) to get some information on the relation between Python and JupyterLab.
This project is available as video (with German audio and subtitles only):

Note: JupyterLab is not restricted to Python programming, but supports almost every other programming language.

22.1.1 Start JupyterLab

Task: Open a webbrowser and go to some JupyterLab provider. Students of Zwickau University should go to
Gauss290 .

Hint: You may also run a local JupyterLab instance. See Install Jupyter Locally (page 289) project for installation
and start.

After starting JupyterLab you should see a file manager sidebar on the left and the launcher tab in the working area
on the right.
Task: Create a new notebook file by clicking the Python button in the launcher tab’s Notebook section.
This creates a file Untitled.ipynb in the current directory. To change the file name click ‘File’ and ‘Rename
Notebook…’ in the top menu.
289 https://jupyter.org
290 https://gauss.fh-zwickau.de

285
Data Science and Artificial Intelligence for Undergraduates

Fig. 22.1: File manager and launcher tab are shown in JupyterLab’s initial view.

22.1.2 Write and Execute Code

The new notebook file shows an empty code cell.

Task: Write the following Python code to the code cell.

a = 2
b = 3
print(a + b)

To run the code do one of the following:

• Click the ‘run selected cells’ button in the toolbar (triangle button).
• Press Ctrl + Return on your keyboard (runs code and keeps focus on current cell).
• Press Shift + Return on your keyboard (runs code and goes to next cell).
One and the same cell may be executed many times (with or without modifiying the code).
Task: Run your code.
A cell’s output is shown directly below the cell.
Task: Write print(a) in a new cell, execute the cell and observe the output.
Each cell knows what happend in other cells.
Task: Click into the first cell. Then insert a new cell by pressing ESC, then A on your keyboard (not ESC + A).
Write print(a), execute and observe output.
Order of cells in the notebook is not of importance. Whether a cell knows about other cells depends on the order of
execution!
Task: Save your notebook.

286 Chapter 22. Install and Use Python

Data Science and Artificial Intelligence for Undergraduates

Fig. 22.2: Your notebook should look as depicted here if you’ve completed all tasks above.

22.1.3 Kernels

JupyterLab is the connection between you and Python. The code you write to a code cell is send to Python for
execution. JupyterLab watches for outputs of your program and displays them to you. The background Python part
is referred to as Python kernel.
The connection between a notebook in JupyterLab and the Python kernel is very loose. You may open a notebook
file without running a kernel. Then code execution isn’t possible. Or you may have a running Python kernel although
you closed your notebook.
Task: Close your notebook file’s tab. Then click the square symbol in the sidebar.

Fig. 22.3: The square button brings up a list of running kernels and some other information.

The Python kernel of our test notebook is still running. Clicking the kernel line reopens the notebook tab. The X
button (only shown on hover) shuts the kernel down.
Task: Shut down the kernel.

Hint: Instead of closing a tab and its kernel separately you may also click ‘File’ and ‘Close and Shutdown Notebook’

Reopening a notebook runs a fresh kernel.

22.1. Working with JupyterLab 287

Data Science and Artificial Intelligence for Undergraduates

Task: Switch to the file manager in the sidebar and open your notebook (double click its name).
Task: Execute the first (that is, top most) cell.
Python complains about an unknown name. The fresh kernel hasn’t seen the a = 2 line up to now, because we did
not execute it. Executing the second cell makes the first work correctly.

Important: Always try to create notebooks which can be executed in the same order as cells appear in the notebook!

Hint: If your code takes too long to run or if it won’t stop for some reason, click ‘Kernel’ and ‘Interrupt Kernel’ in
the menu. This stops code execution and makes the kernel wait for new code.

Hint: To test your notebook’s behavior after launching a fresh kernel but without reopening the file, click ‘Kernel’
and ‘Restart Kernel…’ in the menu.

22.1.4 Cell Types

Up to now we only used code cells. Another important cell type are Markdown cells. Markdown is a markup language
for writing formatted text.

# Some Heading

## A Subsection

Text without formatting. Here's a [link to nowhere](https://nowhere). And a list:

* first item
* second item

This is bold and italic text.

Task: Create a new cell (ESC, then A or B to insert new cell above or below current cell). Switch cell type to
‘Markdown’, either via dropdown in toolbar or via ESC, then M.
Task: Write some Markdown code to the cell and execute the cell. To modify Markdown code double click the cell.
Then edit and execute again.

22.1.5 Terminals

JupyterLab allows to run one or more terminals. A terminal is a text interface to the computer’s operating system.
There you can use operating system features and programs not accessible through JupyterLab. Exampels are copying
files are deleting non-empty directories.

Note: Almost all remote JupyterLab instances run on Linux machines. So in a terminal you have to use Linux
commands, not Windows. Linux, macOS, OpenBSD and many other operating systems share a common set of
commands. Only Windows has its own set.

Task: Open a terminal (click the Terminal button in the launcher tab’s Other section). Then type ls to get a list of
all files in the current directory. Then type logout to close the terminal.
The pwd command prints the current directory’s path. Commands for working with files are cd (change directory),
cp (copy), mv (move, rename), rm (remove), mkdir (create directory), rmdir (remove directory). See Unix
Commands291 for more commands and usage information.
291 https://en.wikibooks.org/wiki/Guide_to_Unix/Commands

288 Chapter 22. Install and Use Python

Data Science and Artificial Intelligence for Undergraduates

Hint: If you close a terminal tab without typing logout the terminal remains active in the background. To reopen
it or to shut it down click the ‘Running Terminals and Kernels’ button in the sidebar.

22.1.6 Quit JupyterLab

To quit JupyterLab click ‘File’ and ‘Log Out’. Closing the browser tab without logging out may cause security prob-
lems.
Before leaving JupyterLab shut down all kernels and terminals. This saves resources on the server. Most Jupyter
providers (including Gauss at Zwickau University) will shut down inactive kernels and JupyterLab sessions after
some hours automatically.

Note: It’s possible to log out from JupyterLab and have a kernel running. Coming back some hours or days later
one can fetch the outputs of long running tasks. Corresponding workflow is described in the Long-Running Tasks
(page 295) project.

22.2 Install Jupyter Locally

We want to set up an extensible system for Python development and data science in general, including Jupyter as one
component. Here we only install the base system. From time to time tools and Python libraries can be added on
demand.

Hint: A Python library is a collection of Python code files extending Python’s set of commands.

22.2.1 Conda

Before we start, we should become aware of two problems:

• A Python development environment consists of many different tools, because many small tools are more flexible
than one monolithic all-in-one tool. In principle, one could install all of them manually. On Windows systems
this would be very time-consuming. Other operating systems, which adhere to the small tools approach (Unix-
like systems), have a package manager for efficient software installation.
• Different programming tasks could require different tools. Sometimes a tool prevents installation of another
tool or of another version of itself. This is especially the case for some Python libraries, because some libraries
depend on specific versions of others. Installing an additional library could require updating an existing one,
but this in turn could corrupt dependencies of already installed libraries.
To circumvent both problems there exist package managers which create and manage multiple separate Python envi-
ronments. So we may have several different sets of tools and libraries in parallel, switching between them whenever
appropriate.
A widely used package manager for Python is Conda292 . It’s part of the Anaconda293 and Miniconda294 Python
distributions. A Python distribution is a collection of tools and libraries for Python development.
Miniconda is a light-weight version of Anaconda with fewer tools and libraries pre-installed.
292 https://conda.io
293 https://www.anaconda.com
294 https://docs.conda.io/en/latest/miniconda.html

22.2. Install Jupyter Locally 289

Data Science and Artificial Intelligence for Undergraduates

22.2.2 Install Miniconda and Anaconda Navigator

Task: Go to Miniconda Installer List295 and download a suitable installer for your system. Then follow the install
instructions296 for your system.
Conda is a command line tool. If you feel more comfortable with GUI tools, install Anaconda Navigator297 .
Task: Open a terminal and run conda install anaconda-navigator in it. This installs Anaconda
Navigator.
Depending on your system now there should be an entry for Anaconda Navigator in your system’s app menu. If not
add an entry manually. Anaconda Navigator executable should be in bin subdirectory of Miniconda’s installation
directory. On Non-Windows run which anaconda-navigator in a terminal to get the path.

Important: Before you install any additional tools with Anaconda Navigator or Conda, read on!

22.2.3 Create a Python Environment

At the moment there is only one Python environment on your system, called base. Don’t install additional packages
to this environment. Create a separate environment for each kind of task, for instance, an environment you use to
work through projects and exercises in this book.
Task: Create a new Python environment ds-book. Either run conda create -n ds-book in a terminal or
go to ‘Environments’ page in Anaconda Navigator. Then click the plus button and follow the GUI instructions.
To switch between environments use Anaconda Navigator or run conda activate environment_name in
a terminal.

22.2.4 Install JupyterLab

Now that we have a Python environment, it’s time to install Jupyter.

Task: In Anaconda Navigator set package filtering to ‘All’. Then head for ‘jupyterlab’ in the package list and mark it
for install. A click on ‘Apply’ starts installation. Alternatively, run conda install jupyterlab in a terminal
(make sure you have activated the correct environment).

Note: Although you selected only one package for install, many more will be installed due to dependencies. Jupyter-
Lab requires a number of other packages and those packages may require others again. Conda manages such depen-
dencies for us.

22.2.5 Launch JupyterLab

The Home page of Anaconda Navigator shows a launch button for JupyterLab. Make sure you have selected the
correct environment in the dropdown above the launch buttons.
Task: Launch JupyterLab via Anaconda Navigator or from the command line: jupyter lab or jupyter-lab.
Freshly installed JupyterLab lives in the environment you installed it in. Creating a new Python environment requires
to install JupyterLab in this environment, too (if you want to use JupyterLab there).
Default behavior of JupyterLab is to show you your home directory and to disallow visiting directories outside your
home directory via JupyterLab. To access a different directory, run JupyterLab from a terminal. Then JupyterLab
will show the directory active in the terminal when launching JupyterLab.
295 https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links
296 https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation
297 https://docs.anaconda.com/anaconda/navigator/
298 https://xkcd.com/1987

290 Chapter 22. Install and Use Python

Data Science and Artificial Intelligence for Undergraduates

Fig. 22.4: The Python environmental protection agency wants to seal it in a cement chamber, with pictorial messages
to future civilizations warning them about the danger of using sudo to install random Python packages.. Source:
Randall Munroe, xkcd.com/1987298

Fig. 22.5: The filter dropdown provides several filters for the package list.

22.2. Install Jupyter Locally 291

Data Science and Artificial Intelligence for Undergraduates

22.2.6 Install Python Packages

There’s already a basic Python installation in your environment. So you can use Python in JuyterLab. Additional
packages (math, visualization,…) can be installed on demand in the same way we installed JupyterLab. Always keep
an eye on the environment name when installing. So things will end up in the correct environment.

Hint: Next to Conda there exist other package managers. A very prominent one is Pip299 . Conda automatically
installs Pip in each environment. To install a package with Pip simply write pip install package_name in
a terminal. Conda will take care of packages installed with Pip, too.
Some packages are only available for install with Conda or only for install with Pip. So both package manager have
to be used in parallel.

22.2.7 JupyterLab Desktop App

Recently, a JupyterLab Desktop App300 has been released. This brings the look and feel of usual GUI apps to
JupyterLab’s start-up process. After start-up there’s no difference to browser based JupyterLab.
Handling of different Python environments is somewhat more difficult than with plain JupyterLab.

22.3 Python Without Jupyter

Jupyter is only one of many Python IDEs (integrated development environments). Although Jupyter is well suited for
data science applications, because text, visualizations and code can be mixed in one and the same documents, other
tools may be appropriate, too.

22.3.1 Interactive Python Without IDE

Python (better: the Python interpreter) is a stand-alone program for running Python code on the command line. It
has an interactive mode, which executes each line of code immediately after writing. Alternatively, we may provide
a file containing Python code and the Python interpreter executes the file’s content.
Task: Open a terminal and type python (don’t forget to activate the correct environment with Conda or Anaconda
Navigator).
Now the Python interpreter runs in interactive mode. All Python commands are allowed. It’s, for instance, a powerful
replacement for a calculator.
Task: Type the following line by line

1 + 2 * 3
a = 2
b = 3
(a + b) ** 3

The result of each line is printed on screen immediately after hitting the return key.
To quit the interpreter we have to call the Python function for leaving a program.
Task: Type exit().
299 https://pypi.org/project/pip/
300 https://github.com/jupyterlab/jupyterlab-desktop

292 Chapter 22. Install and Use Python

Data Science and Artificial Intelligence for Undergraduates

22.3.2 Text Editor Plus Python Interpreter

We may write Python code to a text file and hand it over to the Python interpreter for execution.
There are lots of text editors with additional features for coding, like syntax highlighting, automatic indentation, line
numbers. Two common ones are Kate301 (Linux, MacOS, Windows) and Notepad++302 (Windows). Others you
may hear about are Emacs303 , Vim304 and Nano305 , especially if you work on non-Windows machines (cloud!).

Fig. 22.6: Real programmers set the universal constants at the start such that the universe evolves to contain the disk
with the data they want. Source: Randall Munroe, xkcd.com/378306

Task: Create a text file bye.py with following content:

code = None

while code != "bye":

print("I'm a Python program. Type 'bye' to stop me:")

code = input() # get some input from user

if code == "":
print("To lazy to type?")
print("")

print("Bye")

Indentation matters! Python uses indentations (white space) to structure code. So don’t modify indentation depth
here.
Task: Open a terminal, go to your code file’s directory and run python bye.py.
The Python interpreter now runs your program. When reaching the end of the file, execution stops. Only output from
your program is shown. The Python interpreter itself doesn’t print anything as long as there are no errors in your
program.
301 https://kate-editor.org/
302 https://notepad-plus-plus.org
303 https://www.gnu.org/software/emacs/
304 https://www.vim.org/
305 https://www.nano-editor.org/
306 https://xkcd.com/378

22.3. Python Without Jupyter 293

Data Science and Artificial Intelligence for Undergraduates

22.3.3 Spyder

Spyder307 is a Python IDE for scientific programming with look and feel similar to Octave and Matlab. To use Spyder
install the spyder package with Anaconda Navigator or via conda install spyder in a terminal.
Next to text editor, interactive Python interpreter and separate plotting area Spyder provides tools for debugging (find
errors in programs) and for code profiling (measure execution time and memory consumption).
Task: Run Spyder (click button in Anaconda Navigator or type spyder in terminal).
Task: Run the Spyder tour (Help > Show Tour).
Task: Open bye.py in Spyder. Run the program by clicking the ‘run file’ button in the toolbar (triangle symbol).

Fig. 22.7: Spyder after running bye.py. The variable inspector shows that now there is a variable code set in the
interactive interpreter.

Task: Use Spyder’s interactive Python console for some computations.

22.3.4 Executables

Python code can be bundled together with the interpreter and all required libraries into one executable file. This is
not recommended because this file will be relatively large, but it’s the only way to provide Python programs to people
which have not installed a Python interpreter on their machine.
There are several tools for this job. One is known as PyInstaller and ships with the Anaconda distribution. To convert
your Python source code file into an executable file open a terminal and type

pyinstaller --onefile filename.py

This will create a directory named ‘dist’ containing the executable.

Hint: Install the pyinstaller package via Anaconda Navigator or via conda install pyinstaller in
a terminal.

307 https://www.spyder-ide.org/

294 Chapter 22. Install and Use Python

Data Science and Artificial Intelligence for Undergraduates

Task: Make an executable from

code = None

while code != "bye":

print("I'm a Python program. Type 'bye' to stop me:")

code = input() # get some input from user

if code == "":
print("To lazy to type?")
print("")

print("Bye")

and run it.

22.4 Long-Running Tasks

to be continued…

22.4. Long-Running Tasks 295

Data Science and Artificial Intelligence for Undergraduates

296 Chapter 22. Install and Use Python

CHAPTER

TWENTYTHREE

PYTHON PROGRAMMING

Programming is like riding a bike. If you want to learn it, you have to do it. Although nowadays there’s code available
for almost every standard task, implementing simple algorithms like searching and sorting ourselves is very instructive.
• Simple List Algorithms (page 297)
• Geometric Objects (page 300)
• Vector Multiplication (page 301)

23.1 Simple List Algorithms

In this project we implement simple algorithms related to lists like sorting a list or finding special values. The purpose
of the projecct is threefold:
• familiarize yourself with Python’s syntax,
• learn to algorithmize, that is, how to combine available building blocks to solve a task,
• see and understand how basic algorithms frequently used in data science work.
Before you work through the project you should have read Building Blocks (page 46). Restrict yourself to Python
features discussed there. Don’t use ready-made library functions.

Important: Don’t use list as name for a variable holding some list, although this would be quite expressive.
Several names like print and int and list are already occupied by Python. Python won’t complain about
reusing some of it’s predefined names as variables, but it’s considered bad practice.

23.1.1 Maxmium of a List

Given a list of integers we want to find the list’s greatest value.

Task: Describe a process (that is, algorithm) to find the maximum value of a list in your words, not Python code.
Remember that the list may be of arbitrary length. For simplicity you may assume that the list is not empty.
Solution:

# your solution

Task: Implement your algorithm in Python. Proceed as follows:

1. Create a function get_max which takes a list as argument and, for the moment, always returns the length of
the list.
2. Create a sample list, pass it to your function, and print the returned value to the screen.
3. If the framework is working correctly, implement your algorithm in get_max and make the function return
the maximum value.

297
Data Science and Artificial Intelligence for Undergraduates

4. Test your code with several different sample lists. Include pathological cases like [1, 1, 1] and [1].
Solution:

# your solution

Task: What happens if you test your code with an empty list? Now add some code to your function to check whether
the list is empty. If the list is empty get_max should print a message and return 0.
Solution:

# your solution

23.1.2 Mean of a List

Given a list of floats we want to find the list’s mean.

Task: Describe an algorithm for calculating the mean of a list. Assume that the list has at least one item.
Solution:

# your solution

Task: Implement your algorithm in Python. Proceed as follows:

1. Create a function get_mean which takes a list as argument and, for the moment, always returns the length of
the list.
2. Create a sample list, pass it to your function, and print the returned value to the screen.
3. If the framework is working correctly, implement your algorithm in get_mean and make the function return
the list’s mean.
4. Test your code with several different sample lists. Include obvious cases like [-1, -1, 1, 1] and
pathological cases like [1].
Solution:

# your solution

Task: What happens if you test your code with an empty list? Now add some code to your function to check whether
the list is empty. If the list is empty get_mean should print a message and return 0.
Solution:

# your solution

23.1.3 Count Values

Given a list of integers we want to count how often a given integer occurs in the list.
Task: Write a function count_value taking two arguments (the list and an integer) and returning the number of
occurrences of the integer in the list. Proceed step by step as before. How to handle empty lists here?
Solution:

# your solution

298 Chapter 23. Python Programming

Data Science and Artificial Intelligence for Undergraduates

23.1.4 Sorting a List

There exist plenty of algorithms for sorting308 values in lists. Here we consider selection sort309 for sorting a list of
integers.

Fig. 23.1: Selection sorts devides the list into two parts: sorted items and unsorted items. It repeatedly walks (blue)
through the unsorted items to find the smallest (red) unsorted item. Then it swaps the first unsorted item with the small-
est unsorted item. The swapped item then belongs to the sorted part (yellow). Source: Joestape89, wikipedia.org310 ,
CC BY-SA 3.0311 , modified by the author.

Task: Write a function sort taking a list and returning the sorted list. Proceed step by step as before. Don’t forget
to extensively test your code!
Solution:

# your solution

308 https://en.wikipedia.org/wiki/Sorting_algorithm
309 https://en.wikipedia.org/wiki/Selection_sort
310 https://en.wikipedia.org/wiki/File:Selection-Sort-Animation.gif
311 https://creativecommons.org/licenses/by-sa/3.0/deed.en

23.1. Simple List Algorithms 299

Data Science and Artificial Intelligence for Undergraduates

23.2 Geometric Objects

To get a better feeling for the object-oriented approach to programming we implement classes for geometric objects,
like roughly sketched in Everything is an Object (page 59). How to draw lines can be guessed from Library Code
(page 57). Generally, you should have completed the Crash Course (page 43) chapter before starting with this project.
We’ll need Matplotlib for this project. So we should import it. In principle, imports can be placed anywhere in the
code, but it’s considered good practice to have them in the first lines of a file.

import matplotlib.pyplot as plt

23.2.1 Points

Task: Create a class Point with two member variables x and y. To create a Point object we want to call
Point(3, 4), for instance.
Solution:

# your solution

23.2.2 Triangles

Task: Create a class Triangle. For creating a Triangle object we want to pass three Point objects.
Solution:

# your solution

Task: Add a draw member function to Triangle which draws the triangle on screen using Matplotlib. Keep the
whole class definition in one code cell, because class definitions cannot be scattered over multiple cells.
Task: Create four points and draw two triangles between them (two points are used by both triangles). Remember
to call plt.show() to show the plot on screen. Without plt.show() the plot will show up, too, due to some
Jupyter magic. In plain Python you won’t see the plot.
Solution:

# your solution

23.2.3 Rectangles

Task: Create a class Rectangle representing an paraxial rectangle. For creating a Rectangle object we want to
pass two Point objects, the lower left corner and the upper right corner. Each of the rectangle’s four points should
be accessible as a member variable.
Solution:

# your solution

Task: Add a draw member function to Rectangle (use same code cell as before).
Task: Draw a series of 20 squares centered at the origin and growing from edge length 2 to 40. Call plt.
axis('equal') before plt.show() to make the squares look like squares.
Solution:

300 Chapter 23. Python Programming

Data Science and Artificial Intelligence for Undergraduates

# your solution

23.2.4 Houses

Task: Create a class House representing a house made of a rectangular body and a triangular roof. The roof has
one third of the body’s height and is 20 per cent wider than the body. For creation we want to provide the lower left
corner as well as width of the body and total height of the house. Add a draw method.
Solution:

# your solution

Task: Draw three houses.

Solution:

# your solution

23.2.5 Moving Objects

Task: Add a move method to Point which takes two numbers and moves the point paraxially by the specified
amounts. Then add move methods to Triangle, Rectangle and House, which call Point’s move method.
Task: Write a function row_of_houses for drawing a row of identical houses. Arguments are width and height
of the houses as well as start and end coordinates of the row on the x axis. Create one House object inside the
function. Draw and move the house in a while loop until the house leaves the specified interval.
Solution:

# your solution

Task: Draw a row of houses.

Solution:

# your solution

23.3 Vector Multiplication

The aim of this project is to develop a class (object type) for 3-dimensional vectors. The class shall provide typical
calculations with vectors as Python operators. Before working through this project you should have read Variables
and Operators (page 75) chapter. For a recap of vector operations see Vectors (page 333).
This project consolidates your knowledge on defining custom object types and demonstrates the possibilities of
Python’s flexible approach to customization of operator’s behavior.

23.3. Vector Multiplication 301

Data Science and Artificial Intelligence for Undergraduates

23.3.1 Class Definition and String Representation

We start with a minimal orking example and then add more functionality step by step.
Task: Create a class Vector representing a vector in 3-dimensional space. The __init__ method takes the 3
components as arguments.
Solution:

# your solution

Task: Add dunder functions __str__ and __repr__ returning human readable representations of a Vector
object. Test your code by creating and printing a Vector object.
Solution:

# your test code

23.3.2 Addition

When implementing operations on custom objects we have to decide whether to modify an existing object or to create
a new object holding the operation’s result. For vectors, a + b shouldn’t modify a, but return a new object.
Task: Implement vector addition with + operator. The operation should return a new object holding the result. If
the second operand is not of type Vector, addition should return NotImplemented. At this point we do not
need __radd__ because we only accept Vector objects for addition. So the left-hand side operand will always
be a Vector object and Python will call its __add__ method. Test your code!
Solution:

# your test code

Task: Modify your implementation of addition to accept lists of length 3, too. Don’t forget to implement __radd__
now. Else list plus vector won’t work, because the left-hand side list object doesn’t know how to work with Vector
objects. Thus, Python tries to call __radd__ on the right-hand side operand. As always: test your code!
Solution:

# your test code

23.3.3 Multiplication by Scalar

Task: Implement multiplication of vectors by scalars via *. Both variants, scalar times vector and vector times scalar,
should be supported.
Solution:

# your test code

302 Chapter 23. Python Programming

Data Science and Artificial Intelligence for Undergraduates

23.3.4 Inner Products

Now we run out of operators. Of course, we could * again for inner products and check types to decide whether we
have to do multiplication by scalars or to compute an inner product. The more readable alternative is implementing
inner products without an operator.
Task: Implement a method inner taking a Vector or a list object as argument and returning the inner product
with the vector whose inner method has been called. If the argument neither is a vector nor a 3 element list, return
NotImplemented although the Python intepreter does not care about (because inner is a usual method, not an
operator). If you know about Python’s exception handling mechanism, you should raise NotImplementedError
here.
Solution:

# your test code

23.3.5 Outer Products

Task: In analogy to inner implement a method outer returning the outer product of two vectors.
Solution:

# your test code

23.3.6 Equality

At the moment == behaves like is. But we want to make == compare vectors componentwise.
Task: Implement equality test via == between two Vector objects and a vector and a list. If the second operand is
of incorrect type both operands are considered unequal.
Solution:

# your test code

23.3.7 List-Like Indexing

There’s a dunder method __getitem__ which is called by the Python interpreter whenever indexing syntax [...
] is applied to an object. The index is passed to the method and the method is expected to return the corresponding
item.
Task: Implement __getitem__. For invalid indices return None. If you know about Python’s exception handling
mechanism, you should raise IndexError here.
Solution:

# your test code

Task: Make the len function work on Vector objects. Return value should be 3.
Solution:

# your test code

23.3. Vector Multiplication 303

Data Science and Artificial Intelligence for Undergraduates

304 Chapter 23. Python Programming

CHAPTER

TWENTYFOUR

WEATHER

We will work through several projects related to weather data and forecasting. Data will be obtained from Deutscher
Wetterdienst312 .
• DWD Open Data Portal (page 305)
• Getting Forecasts (page 307)
• Climate Change (page 309)

24.1 DWD Open Data Portal

Deutscher Wetterdienst (DWD)313 is Germany’s public authority for collecting, managing and publishing weather
data from around the world. DWD also creates weather forecasts for Germany and all other regions of the world.
Some years ago DWD launched an Open Data Portal314 and continually extends its services there.
DWD’s open data portal provides lots of data and is very complex. In this project we explore part of its structure and
locate data sources for subsequent projects.

24.1.1 Licensing

We want to use DWD’s data for education and research. Before we delve into the data we should check whether we
are allowed to do this and whether and how to attribute the source.
Task: Find out whether we are allowed to use DWD’s data for education and research puposes.
Solution:

# your answer

Task: Do we have to refer to DWD as data provider? If yes, how?

Solution:

# your answer

312 https://www.dwd.de
313 https://www.dwd.de
314 https://www.dwd.de/opendata

305
Data Science and Artificial Intelligence for Undergraduates

24.1.2 Weather Stations

There are lots of weather stations at Germany collecting weather data. To locate wheather data on the map, we need
a list of stations and their geolocations.
Task: Find a list of all DWD weather stations at Germany measuring air temperature at least once per hour. The list
should containing geolocations (longitude, latitude, altitude) and other parameters. Get the URL, so we can download
it on demand.
Solution:

# your answer

Task: How many weather stations do we have in the list? List all parameters available for the stations.
Solution:

# your answer

Task: Get the station list for hourly precepitation measurements. How many stations do we have here?
Solution:

# your answer

24.1.3 Data

For each weather station we want to have access to all its historical and most recent measurements.
Task: Locate hourly temperature measurements for station Lichtentanne315 . What’s the most recent measurement
(timestamp and temperatur)? What’s the oldest measurement available?
Solution:

# your answer

24.1.4 Metadata

Each station comes with extensive metadata telling a story about the station and its measurements.
Task: Answer the following questions from metadata of Lichtentanne station:
• Did the station move? If yes, when? Was it’s name changed, too?
• Has measurement equipment been replaced? If yes, when?
• What’s the time zone of timestamps in the data?
Solution:

# your answer

Whenever you find abnormalities in measurements, first check metadata!

315 https://www.openstreetmap.org/#map=16/50.6879/12.4329

306 Chapter 24. Weather

Data Science and Artificial Intelligence for Undergraduates

24.2 Getting Forecasts

The Open Data Portal316 of Deutscher Wetterdienst (DWD)317 provides detailed forecasts for Germany and all other
regions of the world in human and machine readable form. The machine readable service is called MOSMIX318 . In
this project we
• collect information on how to use MOSMIX,
• automatically download newly published MOSMIX data,
• convert MOSMIX files to CSV files.
In this project we heavily rely on techniques presented in Accessing Data (page 111).

24.2.1 Investigating and Understanding MOSMIX

DWD’s open data portal is quite complex. Before we start downloading forecasts data we have to find information
on data location and format.
Task: Read about MOSMIX at DWD’s MOSMIX info page319 . Follow relevant links and answer the following
questions:
• What are the differences between MOSMIX S and MOSMIX L?
• What’s the URL of the most recent MOSMIX L file for station ‘Zwickau’?
• What standard file formats are used for MOSMIX files (KMZ files)?
• How long MOSMIX files are available at DWD’s open data portal?
Solution:

# your answers

24.2.2 An Archive of Forecasts

MOSMIX data older than two days gets removed from DWD’s open data portal. To be able to analyze quality of
forecasts (that is, to compare them to real observations) we have to keep them in a local archive. For this purpose we
would have to visit DWD’s open data portal once a day and look for new MOSMIX files. Then we could download
them and add them to our local archive. With Python we may automate this job.
Task: Write a function get_available_mosmix_files which scrapes a list of URLs of all currently available
MOSMIX L files for a selected station from DWD open data portal. Arguments:
• station ID (string).
Return value:
• URLs (list of strings).
Solution:

# your solution

Now it’s time to download the files. Maybe we already downloaded some of them yesterday. So we should have a
look in our archive directory first to avoid downloading more files than necessary.
Task: Write a function download_files which downloads all new files from a list of URLs. Arguments:
316 https://www.dwd.de/opendata
317 https://www.dwd.de
318 https://www.dwd.de/EN/ourservices/met_application_mosmix/met_application_mosmix.html
319 https://www.dwd.de/EN/ourservices/met_application_mosmix/met_application_mosmix.html

24.2. Getting Forecasts 307

Data Science and Artificial Intelligence for Undergraduates

• URLs (list of strings),

• archive path (string).
Return value:
• names of new files (list of strings).
Hints:
• To check whether a file already exists, have a look at os.path.isfile320 .
• Read and write in binary mode because KMZ files aren’t text files.
Solution:

# your solution

24.2.3 KMZ to CSV

Now that we have MOSMIX files in our local storage we should convert them to CSV files. Each row shall contain
all weather parameters for a fixed point of time. First column is the time stamp. All other columns contain all the
weather parameters contained in the MOSMIX files.
Task: Write a function kmz_to_csv for converting a list of KMZ files to CSV files. Arguments:
• archive path (string),
• list of file names (list of strings).
No return value.
Hint: MOSMIX files use an XML feature known as namespaces. Consequently, tag names contain collons, which
confuses Beautiful Soup’s standard HTML parser (which also parses simple XML files). To get MOSMIX files parsed
correctly, install the lxml module and provide a second argument 'xml' to Beautiful Soup’s constructor. This tells
Beautiful Soup to use a dedicated XML parser, which by default is lxml.
Solution:

# your solution

24.2.4 Automatic Daily Download

To collect forecasts over a longer period of time we have to run the developed code once per day. We could implement
a loop and use time.sleep to make Python wait one day before continuing with the next run. The better (simpler
and more efficient) solution is to tell the operating system to run the Python program each day at a fixed time.
On Linux and macOS there is cron (and anacron) for scheduling tasks. On Windows there is the Task Scheduler.
Task: Find out the details about scheduling a daily task on your system. Then make a Python script file from your
code above and let it run once per day.
Solution:

# your steps to schedule a task

320 https://docs.python.org/3/library/os.path.html#os.path.isfile

308 Chapter 24. Weather

Data Science and Artificial Intelligence for Undergraduates

24.3 Climate Change

In this project we download historic weather data from DWD Open Data Portal321 and have a look at annual mean
temperatures and other values at different locations in Germany.
In this project we heavily rely on techniques presented in Accessing Data (page 111) and High-Level Data Management
with Pandas (page 187) as well as on knowledge obtained in the DWD Open Data Portal (page 305) project.
We use the DWD data set Historical daily station observations for Germany322 , see description323 .

24.3.1 Station List

The first step is to get a list of all weather stations in Germany.

Task: Download the station list from DWD Open Data Portal324 , make a nice data frame from it, and save it to a
CSV file. Columns:
• 'id' (DWD station ID, use as index, integer),
• 'name' (string),
• 'latitude' (float),
• 'longitude' (float),
• 'altitude' (integer),
• 'first' (date of first measurement, timestamp),
• 'last' (date of last measurement, timestamp).
Hint: pandas.read_fwf325 is your friend.
Solution:

# your solution

24.3.2 Download Measurements

Task: Get a list of file names of all ZIP files of the data set.
Hint: A good idea is to construct file names from data in the station list (ID, first and last day of measurement). But
it turns out that dates in the list in the file names do not coincide for several files. Thus, we have to scrape file names
from the data set’s file listing326 .
Solution:

# your solution

Task: Process all files. Processing steps are:

1. Download the file.
2. Read the data file contained in the ZIP file.
3. Drop and rename colums and adjust types (see below).
321 https://www.dwd.de/opendata
322 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical
323 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/DESCRIPTION_obsgermany_

climate_daily_kl_historical_en.pdf
324 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/KL_Tageswerte_Beschreibung_

Stationen.txt
325 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html
326 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/

24.3. Climate Change 309

Data Science and Artificial Intelligence for Undergraduates

4. Write data to a CSV file (one large CSV file for data from all stations).
Columns for CSV file:
• date (timestamp of measurement),
• id (station ID, integer),
• 'wind_gust', 'wind_speed', 'precipitation', 'sunshine', 'snow',
'clouds', 'pressure', 'temperature', 'humidity', 'max_temp', 'min_temp',
'min_temp_ground' (float).
Solution:

# your solution

24.3.3 Update Station List

Dates of first and last measurements are incorrect in the station list created above. Now, that we have the measure-
ments, we should correct the list.
Task: For each station get dates for first and last measurement and write them to the station list CSV file. Drop all
stations that do not have any measurements.

# your solution

24.3.4 Plots

Task: Use Series.plot to create different plots:

• mean annual temperature/precipitation/… for the station with highest number of years with measurements,
• mean annual temperature/precipitation/… in Germany (mean over all stations)
• minimum/maximum temperature in Germany for each year

# your solution

310 Chapter 24. Weather

CHAPTER

TWENTYFIVE

MNIST CHARACTER RECOGNITION

A major application of data science and artificial intelligence is recognition of handwritten characters. I a series of
projects we will implement different techniques for this task based on the famous MNIST data set (and related data
sets) for training recognition systems. MNIST is provided by the National Institute of Standards and Technology327
• The xMNIST Family of Data Sets (page 311)
• Load QMNIST (page 313)

25.1 The xMNIST Family of Data Sets

In the project we have a first look at the MNIST data set and related data sets. In subsequent projects we’ll use these
data sets for training machine learning models.
A major benefit from the project is, that we see how difficult data preparation can be. As we’ll learn later on, obtaining
unbiased data is extremely important for training machine learning algorithms.

25.1.1 NIST special database 19

Task: Learn about NIST special data base 19 from

• NIST Special Database 19, Handprinted Forms and Characters Database328 (sections 1 and 2)
• NIST Special Database 19329
Answer the following questions:
• Who collected the data?
• What are the conditions for using the data set?
• Who wrote the characters and digits?
• How many images are in the data set?
• How much disk space is needed?
Solution:

# your answers

327 https://www.nist.gov
328 https://www.nist.gov/system/files/documents/srd/nistsd19.pdf
329 https://www.nist.gov/srd/nist-special-database-19

311
Data Science and Artificial Intelligence for Undergraduates

25.1.2 MNIST

Task: Learn about MNIST data set from

• Wikipedia330
• The MNIST database of handwritten digits331
Answer the following questions:
• Who collected the data?
• What are the conditions for using the data set?
• How many images are in the data set?
• What subset of symbols is shown on the images?
• What’s the size of the images?
• How much disk space is needed?
• What preprocessing steps were done?
• What’s the up to now best error rate for digit recognition based on MNIST?
Solution:

# your answers

25.1.3 QMNIST

Task: Learn about QMNIST data set from

• Cold Case: the Lost MNIST Digits332
• QMNIST333
Answer the following questions:
• Who collected the data?
• What are the conditions for using the data set?
• How many images are in the data set?
• What’s the size of the images?
• How much disk space is needed?
• What preprocessing steps were done?
• Is QMNIST a superset of MNIST?
Solution:

# your answers

Task: Download the QMNIST data set.

330 https://en.wikipedia.org/wiki/MNIST_database
331 http://yann.lecun.com/exdb/mnist/
332 https://arxiv.org/pdf/1905.10498.pdf
333 https://github.com/facebookresearch/qmnist

312 Chapter 25. MNIST Character Recognition

Data Science and Artificial Intelligence for Undergraduates

25.1.4 EMNIST

Task: Learn about EMNIST data set from

• EMNIST: an extension of MNIST to handwritten letters334 (section I and subsections A, B, C of section II)
• The EMNIST Dataset335
Answer the following questions:
• Who collected the data?
• What are the conditions for using the data set?
• How many images are in the data set?
• What’s the size of the images?
• How much disk space is needed?
• What preprocessing steps were done?
• Why EMNIST was created?
Solution:

# your answers

25.2 Load QMNIST

In this project we develop a Python module for loading and preprocessing QMNIST images and metadata. Prereq-
uisites:
• Efficient Computations with NumPy (page 155)
• The xMNIST Family of Data Sets (page 311)

25.2.1 Reading Data

Task: Get QMNIST training and test data from QMNIST GitHub repository336 (4 files ending with ...
idx3-ubyte.gz or ...idx2-int.gz) and find information on the file format.
Task: Write a function load which reads images and metadata from the QMNIST files. Parameters:
• path: defaulting to '', path of directory with data files.
• subset: defaulting to 'train' (load training data), passing 'test' loads test data.
• as_list: defaulting to False (return one large array), passing True returns a list of images.
Return values:
• NumPy array of shape (60000, 28, 28) or list of 60000 NumPy arrays of shape (28, 28) (range
0…1, type float16), depending on parameter as_list.
• NumPy array of shape (60000, ) containing classes (type uint8).
• NumPy array of shape (60000, ) containing series IDs (type uint8).
• NumPy array of shape (60000, ) containing writer IDs (type uint16).
334 https://arxiv.org/pdf/1702.05373.pdf
335 https://www.nist.gov/itl/products-and-services/emnist-dataset
336 https://github.com/facebookresearch/qmnist

25.2. Load QMNIST 313

Data Science and Artificial Intelligence for Undergraduates

Test your function and show first and last images of training and test data. Print corresponding metainformation. You
may use the code from Image Processing with NumPy (page 271) to show images.

Hint: Going the obvious path via zipfile module and np.fromfile fails due to two problems:
1. Python’s zipfile module has some trouble reading the QMNIST files. Try the gzip module337 from
Python’s standard library instead.
2. NumPy’s fromfile is not compatible with file objects created by the gzip module. The fromfile
function will read compressed instead of uncompressed data (for some very knotty technical reasons). Thus,
read with the file object’s read method and use np.frombuffer.

Solution:

# your solution

25.2.2 Preprocessing

Before images can be used preprocessing steps might be appropriate. Given a list of preprocessing steps we would
like to have a function which applies all the steps to all images.
Task: Write a function preprocess which applies a list of preprocessing steps to all images. Parameters:
• images: large NumPy array or list of arrays (images to be processed).
• steps: list of functions; each function takes an image and returns an image.
• as_list: False (default) returns images in large array (and fails if image sizes differ after applying pre-
processing steps); True returns list of images.
Return values:
• list of processed images or large array of images, depening an parameter as_list.
Test your code with two preprocessing steps:
1. horizontal mirrowing,
2. color inversion (black to white, white to black).
Solution:

# your solution

25.2.3 Python Module

Task: Create a Python module qmnist.py providing both functions load and preprocess.
Solution:

# your solution

337 https://docs.python.org/3/library/gzip.html

314 Chapter 25. MNIST Character Recognition

CHAPTER

TWENTYSIX

CAFETERIA

Have a look at the Zwickau and Chemnitz Universities’s menu338 (cafeterias of both universities are operated by
Studentenwerk Chemnitz-Zwickau339 ). In this project we want to scrape as much as possible historic menu data
from that website. Read Accessing Data (page 111) before you start. Section Web Access (page 122) is of particular
importance.

26.1 The API

Often web APIs come with some documentation. In our case we neither see an obvious API nor some documenta-
tion. Clicking through the menus of past weeks and watching the browser’s address bar we see how date and other
information is encoded in the URL. This is our key for scraping historic data.
In addition, there is a link an the lower right looking like information about the API. But it turns out, that there is not
much API related information, but the useful hint on on XML interface340 using the same parameter envoding like
the HTML interface.
Task: Understand the arguments in the HTML URLs. Then try the XML API from your browser’s address bar.
Note all location IDs (for ‘Mensa Ring’ and so on) and the oldest available menu (by trial and error).
Solution:

# your answer

26.2 Legal Considerations

Have a look at the license information341 . There we read that it’s okay to use the data for our intended purposes.
Remember to not fire too many requests in short time to the server! This may trigger some protection mechanism
making the server refuse any communication with us.
• Limit the number of requests per second by pausing your script after each request.
• While developing and testing automatic download limit the total number of requests to a hand full until you’re
certain that your script works correctly.
338 https://www.swcz.de/bilderspeiseplan
339 https://www.swcz.de
340 https://www.swcz.de/bilderspeiseplan/xml.php
341 https://www.swcz.de/bilderspeiseplan/lizenz.php

315
Data Science and Artificial Intelligence for Undergraduates

26.3 Getting Raw Data

We proceed in two steps:

• get all the XML files,
• parse all XML files.
Parsing will require lots of trial and error. Thus, first downloading all files and parsing in a second step avoids repeated
requests to the server while developing and testing code for parsing.
Task: Write a Python script which downloads menu XML files for all week days and mensa IDs 3 and 4. Write all
files into the same directory. Before you start: How many requests will be send to the server? How long will it take
if we send two requests per second?
Solution:

# your solution

26.4 Parsing

Task: From all the downloaded files extract all meals including date, category, description, and prices for students,
staff, guests. Save the data to a CSV file.
Solution:

# your solution

316 Chapter 26. Cafeteria

CHAPTER

TWENTYSEVEN

PUBLIC TRANSPORT

In this series of projects we visualize and analyze public transport networks based on open data.
• Get Data and Set Up the Environment (page 317)
• Find Connections (page 322)

27.1 Get Data and Set Up the Environment

In this project we download public transport data and install several Python packages for its processing. Some basic
knowledge in Python programming is required for this project.

27.1.1 Download Timetable Data

Timetable data for public transport operators in Germany is available in GTFS format342 .
Task: Go to gtfs.de343 . Find available GTFS feeds. What types of transport are contained in each feed? What time
periods are covered by the data? Are we allowed to use the data?
Solution:

# your answers

Task: Download all available data from gtfs.de344 . Note download URLs and terminal commands (if you use the
terminal).

Hint: For download via terminal in Linux use

curl URL -o DESTINATION_FILE_NAME

Solution:

# your notes

342 https://en.wikipedia.org/wiki/GTFS
343 https://www.gtfs.de
344 https://www.gtfs.de

317
Data Science and Artificial Intelligence for Undergraduates

27.1.2 Download OpenStreetMap Data

To compute walking distances between neighboring public transport stops we’ll use data from OpenStreetMap
(OSM)345 . The OSM website provides download of (too) small regions or the whole planet (about 60 GB). Geo-
fabrik GmbH346 provides regional downloads.
Task: Check OSM licence information. Then download OSM data for Europe in PBF format (Germany is not
enough, because GTFS data may contain stops in neighboring countries, if German trains cross borders). Note the
download URL and terminal commands.
Solution:

# your notes

27.1.3 Extract Region of Interest from OSM Data

Extracting walking distances from OSM data requires a lot of memory. Memory consumption grows with size of the
region under consideration. Thus, we should extract our region of interest from Europe’s OSM file.
Task: Find minimum and maximum latitude and longitude of your region of interest (go to OSM and look at the
coordinates of some object on the border of your region of interest).
Solution:

# your answer

There exist many tools for processing OSM data. A very handy one is Osmosis347 . You may use it as Python package
or in terminal. The terminal command for data extraction is

osmosis --rb file=SOURCE_FILE --bb left=... right=... top=... bottom=... --wb␣

↪file=DESTINATION_FILE

Task: Extract your region of interest with Osmosis. Note the full terminal command.
Solution:

# your notes

27.1.4 Conda Environment for GTFS Processing

We want to use the gtfspy348 Python package. It’s unmaintained since 2019 (at least). Thus, installation is tricky
due to outdated dependencies. But it’s a nice package including fast public transport routing. It has been developed for
creating A collection of public transport network data sets for 25 cities349 (also see corresponding GitHub repo350 ).
To avoid messing up your everyday Conda environment with failed installations and broken dependencies create a
new Conda environment for this project.
Task: Create a new Conda environment gtfs. If working on Gauss351 , don’t forget to create a corresponding
ipykernel for Jupyter and to switch your notebook’s kernel to the new one.
Solution:
345 https://www.osm.org
346 http://www.geofabrik.de/
347 https://wiki.openstreetmap.org/wiki/Osmosis
348 https://github.com/CxAalto/gtfspy
349 https://www.nature.com/articles/sdata201889
350 https://github.com/CxAalto/gtfs_data_pipeline
351 https://gauss.fh-zwickau.de

318 Chapter 27. Public Transport

Data Science and Artificial Intelligence for Undergraduates

# your notes

27.1.5 Install osmread

The gtfspy package depends on osmread352 package. But osmread isn’t available via Conda. Via PyPI (that
is, pip) we get an older version with outdated (unsatisfyable) dependencies. Thus, we have to install osmread from
source.
Task: Find out what the following commands do. For each line write a short comment. Then run the commands
(works on Linux, macOS and Co.; for Windows minor modifications may be required).

conda activate gtfs

pip install argparse lxml protobuf==3.20.1
git clone https://github.com/dezhin/osmread.git
cd osmread
python setup.py install
cd ..
rm -r osmread

Solution:

# your notes

27.1.6 Install gtfspy

The gtfspy package comes with outdated dependencies and several programming errors. Thus, we install it from
source as a local package in our working directory. This way we may easily fix issues when they pop up.
Task: Find out what the following commands do. Why do we need the mv commands? For each line write a short
comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be
required).

pip install pandas networkx pyshp nose Cython shapely pyproj mopy geoindex geojson␣
↪matplotlib-scalebar

git clone https://github.com/CxAalto/gtfspy.git

mv gtfspy gtfspy_gitrepo
mv gtfspy_gitrepo/gtfspy gtfspy
rm -r gtfspy_gitrepo

Solution:

# your notes

27.1.7 Patch gtfspy

The gtfspy package uses several outdated library functions (mainly from networkx package) and contains some
programming errors. Some patching is in order…
Task: Implement the modifications listed below and think about why they could be necessary (make short notes).
Solution:

# your notes

352 https://github.com/dezhin/osmread

27.1. Get Data and Set Up the Environment 319

Data Science and Artificial Intelligence for Undergraduates

in gtfspy/osm_tranfer.py:
• replace (line 91)

network_nodes = walk_network.nodes(data="true")

network_nodes = walk_network.nodes(data=True)

• replace (line 139)

walk_network.add_path(way.nodes)

networkx.add_path(walk_network, way.nodes)

• replace (line 143-145)

for node, degree in walk_network.degree().items():

if degree is 0:
walk_network.remove_node(node)

nodes_to_remove = []
good_nodes = networkx.get_node_attributes(walk_network, 'lat').keys()
for node, degree in walk_network.degree():
if degree == 0:
nodes_to_remove.append(node)
elif node not in good_nodes:
nodes_to_remove.append(node)
for node in nodes_to_remove:
walk_network.remove_node(node)

(good_nodes contains all nodes with lat/lon data; nodes without data presumably belong to ways crossing the
map’s border (some nodes dropped by Osmosis, but way not shortened); prevents index errors when computing
edge lengths some lines below)
in gtfspy/networks.py:
• replace (lines 267-270):

events_df.drop('to_seq', 1, inplace=True)
events_df.drop('shape_id', 1, inplace=True)
events_df.drop('duration', 1, inplace=True)
events_df.drop('route_id', 1, inplace=True)

events_df.drop('to_seq', axis=1, inplace=True)

events_df.drop('shape_id', axis=1, inplace=True)
events_df.drop('duration', axis=1, inplace=True)
events_df.drop('route_id', axis=1, inplace=True)

gtfspy/routing/node_profile_multiobjective.py (line 78):

• replace

320 Chapter 27. Public Transport

Data Science and Artificial Intelligence for Undergraduates

assert dep_time_index is 0, "first dep_time index should be zero␣

↪ (ensuring that all connections are properly handled)"

assert dep_time_index == 0, "first dep_time index should be zero␣

↪ (ensuring that all connections are properly handled)"

27.1.8 Create GTFS Data Base

To speed up routing gtfspy stores all data in an SQLite353 data base. That’s a usual file with extension sqlite.
First step in working with gtfspy is to create the data base containing all relevant GTFS feeds.
Task: Have look at the import_gtfs function in gtfspy’s import_gtfs module. Use this function to
transfer GTFS feeds of interest to you to an SQLite data base.
Solution:

# your solution

27.1.9 Extract Region from GTFS Data Base

If imported GTFS data covers a much larger region than the region you are interested in, you should filter the created
data base by region. Else, routing becomes too expensive (in terms of computation time). The gtfspy package
provides such filtering, but it’s expensive, too. Thus, filtering should only be used if it reduces the data base’s size
significantly.
Filtering require three steps:
1. Open the data base to filter by creating a GTFS object, defined in gtfspy’s gtfs module.
2. Create a FilterExtract object, defined in gtfspy’s filter module.
3. Call the FilterExtract object’s filter method.
Task: Have look at gtfspy’s source to learn how to use the above mentioned objects and functions. Then filter the
data base by region (hint: ‘buffer zone’ in gtfspy's source is the region of interest).
Solution:

# your solution

27.1.10 Add OSM Walking Distances to Data Base

To get more realistic walking times between neighboring stops we may extract walking distances from Open-
StreetMap. This step is optional. It requires a lot of memory and computation time, because the whole walk network
(all walkable paths and streets) is extracted from the OSM file. Use OSM walking distances for small regions only.
Without OSM data Euclidean distance are used.
Task: Have look at add_walk_distances_to_db_python in gtfspy’s osm_transfer module. Then
use this function to get OSM walking distances. If your region is too large, have a look at hint below this task.
Solution:

# your solution

353 https://www.sqlite.org

27.1. Get Data and Set Up the Environment 321

Data Science and Artificial Intelligence for Undergraduates

Hint: Without OSM walking distances the routing algorithm will complain about missing the key d_walk in a
dictionary. That’s presumably a bug. Workaround: Whenever you use your data base (without OSM distances) for
routing, add the following lines to your code:

for u, v, data in walk_network.edges(data=True):

data['d_walk'] = data['d']

Here walk_network is an object representing the walk network stored in the data base. It will be created as
preparative step for routing and then passed to the routing algorithm. Place the code between creation of the walk
network and passing the walk network to the routing algorithm.
If you use these two lines of code with OSM distance, OSM distances will be overwritten with Euclidean distances.

27.1.11 Use the Data Base

To use the SQLite data base we have to create a GTFS object, definded in gtfspy’s gtfs module. This object
then provides lots of methods for accessing the data.
Task: Have a look at an GTFS objects stops, get_min_date, get_max_date methods. Call them to get a
list of all stops and the date range covered by the GTFS data.
Solution:

# your solution

27.2 Find Connections

In this project we generate departure times for all stops in a region of interest for connections to one arrival stop with
fixed (latest) arrival time.
The projects uses the gtfspy data base created in the Get Data and Set Up the Environment (page 317) project.
Basic Pandas knowledge is required to solve the tasks (read Series (page 188), Data Frames (page 198), Advanced
Indexing (page 207) before you start, Performance Issues (page 233) may be of interest, too).

27.2.1 Data Base and Time Frame

Task: Connect to the data base, that is, create a gtfspy.gtfs.GTFS object.
Solution:

# your solution

The routing algorithm of gtfspy looks for public transport connections in a user-defined time frame. Start and end
time have to be provided in Unix time354 .
Task: Compute Unix times for start and end of your time frame of interest. Use the GTFS object’s
get_day_start_ut method to convert a date to it’s 00:00 unix time. Then add hours and minutes to this value.

Hint: The Python standard library provides functions for getting Unix times. But GTFS.get_day_start_ut
takes care of time zone information in the GTFS data.

Solution:
354 https://en.wikipedia.org/wiki/Unix_time

322 Chapter 27. Public Transport

Data Science and Artificial Intelligence for Undergraduates

# your solution

27.2.2 Arrival Stop

The routing algorithm of gtfspy computes public transport connections from all stops in the data base to a user-
defined arrival stop. The arrival stop has to be specified by it’s GTFS ID (column 'stop_I' in the data frame
returned by GTFS.stops()).
Task: Get the stops data frame. Use column 'stop_I' (GTFS stop ID) as index. Rename the index column
to 'id' and the column 'stop_id' to 'code' (the stop’s GTFS short name). Drop all columns but 'id',
'code', 'name', 'lat', 'lon'.
Solution:

# your solution

Task: Write some code to find all stops containing some string (e.g., all stops containing 'Zwickau, Zentrum').
Use the stops’ geolocation and OpenStreetMap to decide for an arrival stop.

Hint: An advanced and very comfortable solution is to generate for each relevant stop a link to OSM (with marker
at the stop). Rendering these links as HTML in Jupyter you simply have to click the stops’ links to see where they
are on the map.
• OSM link with marker: https://www.osm.org/?mlat=MARKER_LAT&mlon=MARKER_LON
• HTML rendering for links:

import IPython.display
display(IPython.display.HTML('<a href="URL">LINK_TEXT</a>'))

Solution:

# your solution

27.2.3 Routing

The routing API of gtfspy is relatively complex and unintuitive. To generate all connections to the arrival stop
following steps are necessary:
1. Call gtfspy.routing.helpers.get_transit_connections.
2. Call gtfspy.routing.helpers.get_walk_network(G, max_walk).
3. Create a gtfspy.routing.multi_objective_pseudo_connection_scan_profiler.
MultiObjectivePseudoCSAProfiler object. Pass the results of steps 1 and 2 to the constructor
(arguments transit_events and walk_network).
4. Call the run method of the object created in step 3.
Task: Follow the above steps. Have a look at gtfspy’s source for available arguments. A good walking speed
is 1.5. With track_vehicle_legs and track_time you (presumably) can influence whether connections
with fewer transfers and lower travel time shall be preferred by the routing algorithm.
Solution:

# your solution

27.2. Find Connections 323

Data Science and Artificial Intelligence for Undergraduates

27.2.4 Best Connection

The MultiObjectivePseudoCSAProfiler object now contains information about all connections to the
arrival stop in the specified time frame. The stop_profiles member variable is subscriptable with allowed
indices returned by the keys member function. Indices are stop IDs. If i is a stop ID, then stop_profiles[i].
get_final_optimal_labels() returns an iterable object with one item per connection from stop i to the
arrival stop. Each item has a departure_time member containing the departure time of the connection in Unix
time.
Task: Add a column to your stops data frame, which contains the difference between latest allowed arrival time and
latest possible departure time from the considered stop in minutes. For stops without connection to the arrival stop
use -1.
Solution:

# your solution

27.2.5 Grouping Stops

In the stops data frame most stops appear multiple times, e.g., each platform of a station has its own item
in the data frame. For visualization nearby stops should be merged to one stop. The GTFS object’s
get_stops_within_distance method yields a data frame of nearby stops. The first argument is the con-
sidered stop’s ID, the second argument is the distance in meters.
Task: Think about an algorithm for grouping stops and implement it. Add a column to your stops data frame, which
contains a group ID for each stop. All stops with identical group ID are considered one and the same stop (in the
visualization to create in a follow-up project).
Solution:

# your solution

Task: How many stop groups do you have? What’s the largest group? Show all its stops.
Solution:

# your solution

27.2.6 Save Results

Task: Save your stops data frame to a CSV file.

Solution:

# your solution

324 Chapter 27. Public Transport

CHAPTER

TWENTYEIGHT

CORONA DEATHS

In this project we collect and/or compute death rates before and during the Corona pandemic in Germany. You should
read Dates and Times (page 217) before you start.

28.1 Get some Data

We would like to have monthly death rates for an as long as possible period of time including very recent data.
Task: Download relevant data from Federal Statistical Office (Statistisches Bundesamt)355 in CSV format:
• Destatis, table 12613-0006356
• Destatis, table 12411-0001357
• Destatis, Sonderreihe mit Beiträgen für das Gebiet der ehemaligen DDR, Heft 3358
• Destatis, table 12411-0020359
For each file write a short note on its content.
Solution:

# your notes

Task: Use your favorit spreadsheet tool to compile following CSV files from the downloaded files:
• inhabitants-yearly.csv with columns year, FRG (inhabitants FRG), GDR (inhabitants GDR, 0
from 1990 on)
• inhabitants-quarterly.csv with colums date, inhabitants
• deaths-monthly.csv with columns year, months (numeric 1…12), men, women

28.2 Load Data

We want to use dates as index for data frames. Numbers of inhabitants are related to precise timestamps (end of year
or quarter). Numbers of deaths are related to periods (month).
Task: Read in the three CSV files. Use DatetimeIndex and PeriodIndex for data frames and series. In the
end you should have two series:
• inhabitants with index date (timestamp of last day in year or quarter),
• deaths with index date (monthly period aligned at last day of month).
355 https://www.destatis.de
356 https://www-genesis.destatis.de/genesis//online?operation=table&code=12613-0006
357 https://www-genesis.destatis.de/genesis//online?operation=table&code=12411-0001
358 https://www.statistischebibliothek.de/mir/servlets/MCRFileNodeServlet/DEMonografie_derivate_00000961/Heft_3.pdf
359 https://www-genesis.destatis.de/genesis//online?operation=table&code=12411-0020

325
Data Science and Artificial Intelligence for Undergraduates

Solution:

# your solution

28.3 Death Rates

For calculating monthly death rates we have to get the number of inhabitants on a monthly basis, i.e., the mean number
of inhabitants per month. If we would have daily values for the number of inhabitants we could simply calculate the
mean. But resolution is much coarser. Thus, we have to use (linear) interpolation. A good replacement for the
monthly mean is the (interpolated) value at the 15th of the month.
Task: Use resampling to get interpolated number of inhabitants at the 15th of each month. From these values
construct a series with period index (in analogy to the deaths series’ index). Hint: instead of (integer) index based
linear interpolation you may want to use timestamp based interpolation (see docs).

# your solution

Task: Calculate monthly death rates and plot results with Series.plot().

# your solution

326 Chapter 28. Corona Deaths

Part VII

Mathematics

327
CHAPTER

TWENTYNINE

LOGIC

The field of logic consideres truth values of mathematical expressions.

29.1 Logical Operators

Truth values can be transformed or combined by logical operators.

29.1.1 Not

The not operator inverts the truth values of an expression.

𝑎 not 𝑎
true false
false true

29.1.2 And

The and operator yields true if and only if both operands are true.

𝑎 𝑏 𝑎 and 𝑏
true true true
true false false
false true false
false false false

29.1.3 Or (inclusive)

The or operator yields true if and only if at least one of both operands is true. This is sometimes called inclusive or
because the and-case (both operands true) is included (that is, yields true).

𝑎 𝑏 𝑎 or 𝑏
true true true
true false true
false true true
false false false

329
Data Science and Artificial Intelligence for Undergraduates

29.1.4 Or (exclusive)

The xor operator yields true if and only if exactly one of both operands is true. This called exclusive or because the
and-case is excluded.

𝑎 𝑏 𝑎 xor 𝑏
true true false
true false true
false true true
false false false

330 Chapter 29. Logic

CHAPTER

THIRTY

COMBINATORICS

Combinatorics is the mathematical field of counting. An application are discrete distributions in simple probability
theory.

30.1 Factorial

The factorial of a non-negative integer 𝑛 is the integer

𝑛! ∶= 1 ⋅ 2 ⋅ ⋯ ⋅ (𝑛 − 1) ⋅ 𝑛,

where 0! ∶= 1.
Obviously, 𝑛! = (𝑛 − 1)! 𝑛.

331
Data Science and Artificial Intelligence for Undergraduates

332 Chapter 30. Combinatorics

CHAPTER

THIRTYONE

LINEAR ALGEBRA

This chapter summarizes tools and results from linear algebra used throughout the book.
• Vectors (page 333)
• Matrices (page 334)
• Systems of Linear Equations (page 336)

31.1 Vectors

The term vector is used in several different contexts, each coming with a slightly different definition. In basic linear
algebra a vector often is considered a finite column of numbers. That’s the approach we follow here.

31.1.1 Definition

For 𝑑 ∈ ℕ a 𝑑-tuple is an ordered list of real numbers, typically written as (𝑥1 , 𝑥2 , … , 𝑥𝑑 ) with 𝑥1 , … , 𝑥𝑑 denoting
the numbers. Here ‘ordered’ means that swapping two unequal numbers in the list yields a different 𝑑-tuple. Example:
(1, 2, 3) ≠ (1, 3, 2). By ℝ𝑑 we denote the set of 𝑑-tuples.
Vector is another term for 𝑑-tuple. In linear algebra vectors may be interpreted as points in space or as difference
between two points (that is, describing a translation). Vectors often are written as columns:

𝑥1
⎡𝑥 ⎤
𝑥 = ⎢ 2⎥ .
⎢ ⋮ ⎥
⎣𝑥𝑑 ⎦

31.1.2 Length of a Vector

The (Euclidean) length of a vector 𝑥 is defined as

|𝑥| ∶= √𝑥21 + ⋯ + 𝑥2𝑑 .

333
Data Science and Artificial Intelligence for Undergraduates

31.1.3 Sum of Vectors

Sums of vectors are defined componentwise:

𝑥1 + 𝑦1
𝑥 + 𝑦 ∶= ⎡
⎢ ⋮ ⎤.
⎥
⎣𝑥𝑑 + 𝑦𝑑 ⎦

31.1.4 Multiples of Vectors

Products of real numbers and vectors are defined componentwise. For 𝑎 ∈ ℝ and 𝑥 ∈ ℝ𝑑 we have

𝑎 𝑥1
𝑎 𝑥 ∶= ⎡
⎢ ⋮ ⎥.
⎤
⎣𝑎 𝑥𝑑 ⎦

31.1.5 Inner Products

The inner product of two vectors 𝑥, 𝑦 ∈ ℝ𝑑 is

⟨𝑥, 𝑦⟩ ∶= 𝑥1 𝑦1 + ⋯ + 𝑥𝑑 𝑦𝑑 .

Inner products are closely related to angles between vectors.

31.1.6 Outer Products

The outer product of two vectors 𝑥, 𝑦 ∈ ℝ3 is

𝑥2 𝑦3 − 𝑥3 𝑦2
𝑥 × 𝑦 ∶= ⎡ ⎤
⎢𝑥3 𝑦1 − 𝑥1 𝑦3 ⎥ .
⎣𝑥1 𝑦2 − 𝑥2 𝑦1 ⎦
The outer product yields a vector orthogonal to both factors.

31.2 Matrices

A matrix is a rectangular scheme of numbers. Matrices frequently appear in almost all fields of mathematics be-
cause they can be used to represent abstract concepts like linear mappings numerically. Many abstract operations in
mathematics boil down to matrix computations as soon as concrete numerical examples are considered.

31.2.1 Definition

For 𝑚 ∈ ℕ and 𝑛 ∈ ℕ a matrix is an 𝑚-tuple of 𝑛-tuples (see Vectors (page 333) for definition of tuples). Example:

((1, 4, 5, −2), (3, 2, −7, 10), (−2, 4, 5, 8)) here 𝑚 = 3 and 𝑛 = 4.

To simplify notation matrices are written as rectangular schemes:

1 4 5 −2 1 4 5 −2
⎛
⎜3 2 7 10 ⎞
⎟ or ⎡3 2 7 10 ⎤
⎢ ⎥
⎝−2 4 5 8⎠ ⎣−2 4 5 8⎦

The set of all realvalued matrices with 𝑚 rows and 𝑛 columns is denoted as ℝ𝑚×𝑛 .

334 Chapter 31. Linear Algebra

Data Science and Artificial Intelligence for Undergraduates

Matrices usually are denoted by uppercase letters and a matrix’ elements by corresponding lowercase letters with
double index. The first index denotes the row, the second the column of the element. Example:

𝑎11 𝑎12 𝑎13 𝑎14

𝑗=1,…,4
𝐴=⎛
⎜𝑎21 𝑎22 𝑎23 𝑎24 ⎞
⎟ = (𝑎𝑖𝑗 )𝑖=1,…,3 .
⎝𝑎31 𝑎32 𝑎33 𝑎34 ⎠

Row 𝑖 is denoted by 𝑎𝑖• , column 𝑗 by 𝑎•,𝑗 . Rows and columns can be regarded as vectors.

31.2.2 Special Matrices

A matrix with identical number of rows and columns (𝑚 = 𝑛) is called square matrix. The tuple (𝑎11 , 𝑎22 , … , 𝑎𝑚𝑚 )
is called main diagonal of the matrix 𝐴.
A square matrix of the form
1 0 ⋯ 0 0
⎛
⎜ 0 1 ⋱ 0⎞⎟
⎜
⎜ ⎟
⎜ ⋮ ⋱ ⋱ ⋱ ⋮⎟⎟
⎜
⎜0 ⎟
⋱ 1 0⎟
⎝0 0 ⋯ 0 1⎠
is called identity matrix.
Matrices of the form
∗ ⋯ ⋯ ∗ ∗ 0 ⋯ 0
⎛
⎜0 ⋱ ⋮⎞⎟ ⎛
⎜ ⋮ ⋱ ⋱ ⋮⎞⎟
⎜
⎜⋮ ⎟ and ⎜ ⎟
⋱ ⋱ ⋮⎟ ⎜⋮ ⋱ 0⎟
⎝0 ⋯ 0 ∗⎠ ⎝∗ ⋯ ⋯ ∗ ⎠
are called upper-triangular and lower-triangular.

31.2.3 Transpose

Given a matrix 𝐴 ∈ ℝ𝑚×𝑛 its transpose is the matrix

𝑗=1,…,𝑛
𝐴T ∶= (𝑎𝑗𝑖 )𝑖=1,…,𝑚 ∈ ℝ𝑛×𝑚 ,

that is, the same matrix as 𝐴, but with rows and columns interchanged.
The double transpose is equivalent to the original matrix:
T
(𝐴T ) = 𝐴.

31.2.4 Matrix Multiplication

The product of two matrices 𝐴 ∈ ℝ𝑚×𝑛 and 𝐵 ∈ ℝ𝑛×𝑝 is the matrix 𝐶 ∶= 𝐴 𝐵 ∈ ℝ𝑚×𝑝 with entries
𝑛
𝑐𝑖𝑘 ∶= ∑ 𝑎𝑖𝑗 𝑏𝑗𝑘 .
𝑗=1

31.2. Matrices 335

Data Science and Artificial Intelligence for Undergraduates

31.2.5 Inverse Matrix

A square matrix 𝐴 ∈ ℝ𝑛×𝑛 is called invertible if there is a square matrix 𝐵 ∈ ℝ𝑛×𝑛 such that

𝐴𝐵 = 𝐼 and 𝐵 𝐴 = 𝐼,

where 𝐼 is the identity matrix. The matrix 𝐵 is called the inverse of 𝐴.

31.2.6 Determinants

The determinant det 𝐴 of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is the real number computed from the following iterative rule:

⎧𝑎1,1 , if 𝑛 = 1,
{ 𝑛
det 𝐴 ∶= ⎨
∑(−1)1+𝑗 𝑎1,𝑗 det 𝐴1,𝑗 , if 𝑛 > 1,
{
⎩ 𝑗=1

where 𝐴1,𝑗 is the submatrix of 𝐴 originating from removing row 1 and column 𝑗.
The determinant is non-zero if and only if the matrix is invertible. It is positive if the matrix columns constitute a
right-handed coordinate system and negative if columns constitute a left-handed coordinate system.
For each fixed 𝑖 ∈ {1, … , 𝑛} we get an equivalent iterative definition:
𝑛
det 𝐴 = ∑(−1)𝑖+𝑗 𝑎𝑖,𝑗 det 𝐴𝑖,𝑗 for 𝑛 > 1,
𝑗=1

where 𝐴𝑖,𝑗 is 𝐴 without row 𝑖 and without column 𝑗.

One can show that
det 𝐴 = det 𝐴T ,
which implies
𝑛
det 𝐴 = ∑(−1)𝑖+𝑗 𝑎𝑖,𝑗 det 𝐴𝑖,𝑗 for 𝑛 > 1
𝑖=1

for all 𝑗 ∈ {1, … , 𝑛}.

31.3 Systems of Linear Equations

A system of linear equations with 𝑚 equations and 𝑛 unknowns 𝑥1 , … , 𝑥𝑛 has the form

𝑎1,1 𝑥1 + ⋯ + 𝑎1,𝑛 𝑥𝑛 = 𝑏1
⋮
𝑎𝑚,1 𝑥1 + ⋯ + 𝑎𝑚,𝑛 𝑥𝑛 = 𝑏𝑚

with coefficients 𝑎𝑖𝑗 and right-hand sides 𝑏1 , … , 𝑏𝑚 .

Setting
𝑎1,1 ⋯ 𝑎1,𝑛 𝑥1 𝑏1
𝐴 ∶= ⎛
⎜ ⋮ ⋮ ⎞⎟, 𝑥 ∶= ⎛
⎜ ⋮ ⎟⎞ and 𝑏 ∶= ⎛
⎜ ⋮ ⎞⎟
⎝𝑎𝑚,1 ⋯ 𝑎𝑚,𝑛 ⎠ 𝑥
⎝ 𝑛⎠ 𝑏
⎝ 𝑚⎠
we may write a system of linear equations in its matrix form

𝐴 𝑥 = 𝑏.

Depending on coeefficient matrix 𝐴 and right-hand side 𝑏 a system of linear equations either has no solution or exactly
one solution or infinitely many solutions.

336 Chapter 31. Linear Algebra

Part VIII

Software Development

337
CHAPTER

THIRTYTWO

UNIFIED MODELING LANGUAGE (UML)

There’s a standard for visualizing the design of software and other systems: unified modeling language (UML). Next
to many other types of visualization it standardizes how to express relations between classes graphically. Especially,
inheritance relations can be visualized. We do not go into the details here. But you should know that there is a
standard and from time to time you should practice reading UML class diagrams, since they are used for planning
and communicating larger software projects.

Fig. 32.1: Example of an UML class diagram.

To get an overview of UML class diagrams have a look at Class diagram360 at Wikipedia.
An open source tool for drawing UML class diagrams is UMLet361 .
Other types of diagrams are shown in Wikipedia’s article Unified Modeling Language362 .

360 https://en.wikipedia.org/wiki/Class_diagram
361 https://www.umlet.com
362 https://en.wikipedia.org/wiki/Unified_Modeling_Language

339

A General Theory of Crime PDF
No ratings yet
A General Theory of Crime PDF
306 pages
Python Journeyman
100% (6)
Python Journeyman
429 pages
Tutorial
No ratings yet
Tutorial
165 pages
Dieter Rams 10 Principles of Good Design 2
100% (1)
Dieter Rams 10 Principles of Good Design 2
87 pages
Python Scripting
100% (1)
Python Scripting
126 pages
Macrosociology Research and Theory - Coleman, James S - 1970
No ratings yet
Macrosociology Research and Theory - Coleman, James S - 1970
220 pages
Cmpt142 Readings
No ratings yet
Cmpt142 Readings
173 pages
Python Tutorial
No ratings yet
Python Tutorial
128 pages
Data Science Book
No ratings yet
Data Science Book
383 pages
National Hydrogen Strategy 2024
No ratings yet
National Hydrogen Strategy 2024
104 pages
Quantecon Python Programming
No ratings yet
Quantecon Python Programming
388 pages
Quantecon Python Programming
No ratings yet
Quantecon Python Programming
388 pages
Python
No ratings yet
Python
151 pages
Denon AVR S930H Manual PDF
No ratings yet
Denon AVR S930H Manual PDF
275 pages
The Accounting Equation
No ratings yet
The Accounting Equation
49 pages
100 Page Python Intro
No ratings yet
100 Page Python Intro
117 pages
Python Course Book
No ratings yet
Python Course Book
219 pages
Sanet - ST B0CZ4JHLCY
No ratings yet
Sanet - ST B0CZ4JHLCY
168 pages
Quantecon Python Programming
No ratings yet
Quantecon Python Programming
384 pages
OOP in Python-Textbok
No ratings yet
OOP in Python-Textbok
221 pages
Python Tutorial
No ratings yet
Python Tutorial
128 pages
Python Tutorial
No ratings yet
Python Tutorial
128 pages
A Commentary On The Psalms 141 Allen Ross Instant Download
No ratings yet
A Commentary On The Psalms 141 Allen Ross Instant Download
78 pages
Official Python Tutorial
100% (1)
Official Python Tutorial
127 pages
Technical Specification For Dow Corning 789
No ratings yet
Technical Specification For Dow Corning 789
2 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
151 pages
Zero To Py
No ratings yet
Zero To Py
320 pages
Lecture 04 2022
No ratings yet
Lecture 04 2022
39 pages
Tutorial
No ratings yet
Tutorial
134 pages
Tutorial
No ratings yet
Tutorial
157 pages
Research in Graphic Design: Graphic Design in Research - Agata Korzenska PDF
100% (5)
Research in Graphic Design: Graphic Design in Research - Agata Korzenska PDF
296 pages
Practice Occupational Health & Safety Procedures
No ratings yet
Practice Occupational Health & Safety Procedures
27 pages
Python Tutorial 27
100% (2)
Python Tutorial 27
134 pages
Python Notes
No ratings yet
Python Notes
75 pages
Python Tutorial 3.3.2. (En)
100% (1)
Python Tutorial 3.3.2. (En)
127 pages
Energies-To Charge or To Sell
No ratings yet
Energies-To Charge or To Sell
17 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
155 pages
Guido Van Rossum, Fred L. Drake, JR., (Editor) - Python Tutorial. Release 3.2.3 (2012, Python Software Foundation)
No ratings yet
Guido Van Rossum, Fred L. Drake, JR., (Editor) - Python Tutorial. Release 3.2.3 (2012, Python Software Foundation)
105 pages
A Transformer-Based Approach For EV Battery Pack Useful Life Estimation
No ratings yet
A Transformer-Based Approach For EV Battery Pack Useful Life Estimation
8 pages
Screenshot 2023-04-12 at 5.34.59 PM
No ratings yet
Screenshot 2023-04-12 at 5.34.59 PM
24 pages
Tutorial
No ratings yet
Tutorial
151 pages
Industrial Revolution in Mexico
No ratings yet
Industrial Revolution in Mexico
10 pages
HandBook - Introduction To AI and AI Project Cycle
No ratings yet
HandBook - Introduction To AI and AI Project Cycle
34 pages
Ciena
No ratings yet
Ciena
17 pages
How To Make Simple Creamy Coffee Milk
No ratings yet
How To Make Simple Creamy Coffee Milk
6 pages
Marriage and Family Manual
100% (3)
Marriage and Family Manual
211 pages
Tutorial of Python 3.2
No ratings yet
Tutorial of Python 3.2
153 pages
DS-3T0510HP-E HS Datasheet 20240315
No ratings yet
DS-3T0510HP-E HS Datasheet 20240315
5 pages
Python-3 6 9tutorial
No ratings yet
Python-3 6 9tutorial
155 pages
Diggory v. Hogwarts - Wizarding World Mock Trial Association
No ratings yet
Diggory v. Hogwarts - Wizarding World Mock Trial Association
85 pages
GIR Cow & Calf - Caretaking Guidelines
No ratings yet
GIR Cow & Calf - Caretaking Guidelines
2 pages
Applied Sciences: Water Jacket Systems For Temperature Control of Petri Dish Cell Culture Chambers
No ratings yet
Applied Sciences: Water Jacket Systems For Temperature Control of Petri Dish Cell Culture Chambers
18 pages
Tutorial
No ratings yet
Tutorial
133 pages
Tutorial PDF
No ratings yet
Tutorial PDF
145 pages
Python 3.3.7 PDF
No ratings yet
Python 3.3.7 PDF
157 pages
Tutorial Python 3.7
No ratings yet
Tutorial Python 3.7
155 pages
Tutorial
No ratings yet
Tutorial
145 pages
FP Growth Algorithm
No ratings yet
FP Growth Algorithm
100 pages
Pink and Purple Playful Illustrative Restaurant Sales Trifold Brochure
No ratings yet
Pink and Purple Playful Illustrative Restaurant Sales Trifold Brochure
1 page
Python Tutorial
No ratings yet
Python Tutorial
136 pages
Tutorial
No ratings yet
Tutorial
147 pages
FastLanePythonF2018 PDF
No ratings yet
FastLanePythonF2018 PDF
171 pages
Python Tutorial: Guido Van Rossum and The Python Development Team
No ratings yet
Python Tutorial: Guido Van Rossum and The Python Development Team
6 pages
Amar Singh Dadal Booking Ticket
No ratings yet
Amar Singh Dadal Booking Ticket
1 page
Python 2.7
No ratings yet
Python 2.7
128 pages
Anti Textbook Py
No ratings yet
Anti Textbook Py
74 pages
Computer Engineering Syllabus Sem IV - Mumbai University
No ratings yet
Computer Engineering Syllabus Sem IV - Mumbai University
28 pages
Successprofiles WP Ddi1
No ratings yet
Successprofiles WP Ddi1
11 pages
2nd Quarter Summative TLE 8
No ratings yet
2nd Quarter Summative TLE 8
4 pages
Byte of Python
No ratings yet
Byte of Python
129 pages
Starlight 11 Sbornik Gram Upr KLYuChI
No ratings yet
Starlight 11 Sbornik Gram Upr KLYuChI
4 pages
Python Tutorial: Release 2.7.3
No ratings yet
Python Tutorial: Release 2.7.3
128 pages
AM012KNTDCH
No ratings yet
AM012KNTDCH
1 page
Turbo Sorb
No ratings yet
Turbo Sorb
4 pages
Lesson Exemplar-COT2 - HARLYN GRACE CENIDO
No ratings yet
Lesson Exemplar-COT2 - HARLYN GRACE CENIDO
4 pages
Python 3 Tutorial
No ratings yet
Python 3 Tutorial
132 pages
Python Tutorial3
No ratings yet
Python Tutorial3
1 page
6.1.3 Determination of Legal Requirements and Other Requirements
No ratings yet
6.1.3 Determination of Legal Requirements and Other Requirements
1 page
Tutorial
No ratings yet
Tutorial
126 pages
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
GUI Magic: Mastering Real Projects in Python
From Everand
GUI Magic: Mastering Real Projects in Python
John Nunez
No ratings yet
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
From Everand
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
Mohammad
No ratings yet
Cybersecurity for Executives: A Guide to Protecting Your Business
From Everand
Cybersecurity for Executives: A Guide to Protecting Your Business
Matthew C. Smith
No ratings yet
Conquering the Competition: Strategies for Standing Out in the Gaming Content Landscape
From Everand
Conquering the Competition: Strategies for Standing Out in the Gaming Content Landscape
Rian McCullen
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Design and Technology in Today's World: A First Look
From Everand
Design and Technology in Today's World: A First Look
Baz Professor
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
A To Z of Internet: Everything You Wanted to Know
From Everand
A To Z of Internet: Everything You Wanted to Know
Bittu Kumar
No ratings yet
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)

Data Science Ai

Uploaded by

Data Science Ai

Uploaded by

Data Science and Artificial

Intelligence for Undergraduates

Jan 19, 2023

I How to Use This Book 3

4 Python and Jupyter 27

5 Computers and Programming 33

III Python for Data Science 41

7 Variables and Operators 75

8 Lists and Friends 93

10 Accessing Data 111

12 Modules and Packages 135

13 Error Handling and Debugging Overview 139

15 Further Python Features 149

IV Managing Data with Python 153

17 Saving and Loading Non-Standard Data 181

18 High-Level Data Management with Pandas 187

20 Python Programming 243

21 Managing Data 269

23 Python Programming 297

25 MNIST Character Recognition 311

27 Public Transport 317

28 Corona Deaths 325

VII Mathematics 327

31 Linear Algebra 333

How to Use This Book

1.1 Read Online or Offline

1.1.1 Online in a Webbrowser

1.1.2 Offline in a Webbrowser

1.1.3 Offline PDF ebook

1.2 Manipulate and Execute Code

6 Chapter 1. An Executable Book

1.2.1 Sample Code for Testing

1.2.2 Launch on Binder

1.2. Manipulate and Execute Code 7

1.2.3 Launch on Gauss

1.2.4 Live Code

8 Chapter 1. An Executable Book

1.2.5 Local Code Execution

1.3.1 Repository Button

1.3.2 Open Issue Button

1.3.3 Suggest Edit Button

10 Chapter 1. An Executable Book

12 Chapter 1. An Executable Book

1.3.4 Other Forms of Contribution

14 Chapter 1. An Executable Book

2.1 Data Science I (course at WHZ)

2.1.2 Python for Data Science

Week 2 (Crash Course I)

Week 3 (Crash Course II)

Week 4 (Variables and Operators)

16 Chapter 2. Guided Reading

• Weather (page 305)

Week 5 (Lists and Friends, Strings)

Week 6 (Accessing Data)

2.1. Data Science I (course at WHZ) 17

Week 7 (Functions, Modules, Packages)

Week 8 (Errors, Debugging, Inheritance)

2.1.3 Managing Data with Python

Week 9 (NumPy Basics)

18 Chapter 2. Guided Reading

Week 10 (Advanced NumPy)

Week 11 (Pandas Basics)

Week 12 (Advanced Indexing, Dates and Times)

2.1. Data Science I (course at WHZ) 19

Week 13 (Categories, Restructuring)

2.2 Data Science II (course at WHZ)

2.3 Data Science III (course at WHZ)

2.4 Data Science IV (course at WHZ)

20 Chapter 2. Guided Reading

DATA SCIENCE, AI, MACHINE LEARNING

3.1 Science With and Of Data

3.2 Example: Customer Segmentation

3.3 Example: Weather Forecast

24 Chapter 3. Data Science, AI, Machine Learning

• apply some standard method for predictive modeling

3.4 Artificial Intelligence