Data Science Ai
Data Science Ai
Jens Flemming
2 Guided Reading 15
2.1 Data Science I (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Data Science II (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Data Science III (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Data Science IV (course at WHZ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II Warm-Up 21
3 Data Science, AI, Machine Learning 23
3.1 Science With and Of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Example: Customer Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Example: Weather Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
i
7.1 Names and Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.3 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9 Strings 105
9.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9.2 Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.3 String Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11 Functions 127
11.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.2 Passing Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.3 Anonymous Functions (Lambdas) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.4 Function and Method Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.5 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
14 Inheritance 143
14.1 Principles of Object-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14.2 Idea and Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
14.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
14.4 Type Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
14.5 Every Class is a Subclass of object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
14.6 Virtual Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.7 Multiple Inheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
14.8 Exceptions Inherit from Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
ii
15.4 The set Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.5 Function Decorators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.6 The copy Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
15.7 Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
15.8 Graphical User Interfaces (GUIs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
V Exercises 237
19 Computer Basics 239
19.1 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
19.2 Representation of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
19.3 Memory vs. Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
19.4 Compilers and Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
iii
21.3 Pandas Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
21.4 Pandas Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
21.5 Advanced Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
21.6 Pandas Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
VI Projects 283
22 Install and Use Python 285
22.1 Working with JupyterLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
22.2 Install Jupyter Locally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
22.3 Python Without Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
22.4 Long-Running Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
24 Weather 305
24.1 DWD Open Data Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
24.2 Getting Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
24.3 Climate Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
26 Cafeteria 315
26.1 The API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
26.2 Legal Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
26.3 Getting Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
26.4 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
30 Combinatorics 331
30.1 Factorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
iv
VIII Software Development 337
32 Unified Modeling Language (UML) 339
v
vi
Data Science and Artificial Intelligence for Undergraduates
This book covers a wide range of topics in data science and artificial intelligence. It’s an attempt to provide self-
contained learning material for first-year students in data science related courses. Most, not all, of the material is
tought in the undergradute course on data science1 at Zwickau University of Applied Sciences2 .
Starting teaching data science in 2019 the author3 faced the problem that there seems to be no text book covering
math, computer science, statistical data science, artificial intelligence and related topics in a well structured, accessible,
thorough way. Basic Python4 programming should be covered as well as state of the art deep reinforcement learning
for controlling autonomous robots. All this with hands-on experience for students, interesting real-world data sets,
and sufficiently rich theoretical background.
Classical paper books or PDF ebooks do not suit the needs for this project. Working with data requires lots of source
code, interactive visualizations, data listings, and easy to follow pointers to online resources. Jupyter Book5 is an
awesome software tool for publishing book-like interactive content. For the author writing this book is also a journey
of discovery to the possible future of publishing. Having authored two paper books the author knows the tight limits
of paper books and publishing companies. The greater his enthusiasm is for the freedom in writing and publishing
provided by Jupyter Book and its community, The Executable Books Project6 .
The author expresses its gratitude towards all the more or less anonymous people developing the wonderful open
source tools used in this book and for writing the book. There are too many tools to list them here. The author also
thanks his students and colleagues at Zwickau University, especially Hendrik Weiß, who constantly find typos and
make suggestions for improving the book.
Jens Flemming7 , Zwickau, December 2022
1 https://datascience.fh-zwickau.de
2 https://www.fh-zwickau.de
3 https://www.fh-zwickau.de/~jef19jdw
4 https://www.python.org
5 https://jupyterbook.org
6 https://executablebooks.org
7 https://www.fh-zwickau.de/~jef19jdw
CONTENTS 1
Data Science and Artificial Intelligence for Undergraduates
2 CONTENTS
Part I
3
CHAPTER
ONE
AN EXECUTABLE BOOK
This book provides several online features for executing presented Python code and for contributing content. This
chapter contains all information you need for using these features.
• Read Online or Offline (page 5)
• Manipulate and Execute Code (page 6)
• Contribute (page 9)
This book has been created with Jupyter Book8 . It comes in different formats and variants. The reader may choose
according to her or his preferences. The reader may even switch between different variants at will.
The intended medium for reading this book is a website in a webbrowser, that is, the HTML rendering of the book9 .
There you have full functionality including interactive features and live code editing and execution, see Manipulate
and Execute Code (page 6) for details.
The HTML rendering comes with an optional fullscreen mode (button in the upper right corner). You may also hide
the table of contents sidebar on small screens (button in the upper left corner).
Fig. 1.1: Fullscreen button and sidebar button allow to adjust the book’s layout to your screen.
8 https://jupyterbook.org
9 https://www.fh-zwickau.de/~jef19jdw/data-science-ai
5
Data Science and Artificial Intelligence for Undergraduates
You may download the whole HTML rendering as ZIP archive. After extracting the archive open the file index.
html in a webbrowser. All features will work like in the online version, but some content, like externally hosted
videos, won’t be available without internet connection.
The HTML rendering is a static website, that is, no webserver is needed for reading. Everything works on your local
machine.
Download HTML rendering in ZIP file
10
For printing and for friends of higher quality typesetting there is a PDF version of the book. Of course, the PDF
version lacks interactive features like in-place code editing and execution.
Download PDF ebook
11
Python code in this book can be executed in different ways without copying the code manually. The HTML rendering’s
upper right corner shows a rocket symbol. The rocket button provides several options for executing a page’s code.
Fig. 1.2: Hovering over the rocket symbol provides several options for code execution.
The next section contains some Python code for testing code execution right here on this page. Subsequent sections
describe button functionality in more detail. Local code execution on your machine is described, too.
Attention: All code execution features but Live Code use the book’s Jupyter12 rendering. For technical reasons
the Jupyter rendering lacks some figures and text formatting may be incorrect. For reading without a need for
code execution stay with the HTML or PDF renderings.
10 https://www.fh-zwickau.de/~jef19jdw/data-science-ai/data-science-ai.zip
11 https://www.fh-zwickau.de/~jef19jdw/data-science-ai/data-science-ai.pdf
12 https://jupyter.org
Here we have some simple Python code for testing code execution features of this executable book. Details on these
features are given below.
a = 2
b = 6
print(a, '+', b, '=', a + b)
2 + 6 = 8
The Binder launch button opens a JupyterLab13 session on mybinder.org14 . There you find the book’s Jupyter ren-
dering. The Jupyter rendering is a collection of Jupyter Notebooks (files with ipynb extension). The Binder launch
button opens your current page’s Jupyter rendering, but all other pages are available, too, in one and the same Binder
session.
Fig. 1.3: Binder startup requires cloning the book’s Git repository if something has changed since last Binder usage.
Starting the Binder session may take some seconds. Keep in mind that mybinder.org15 is a free service provided by
volunteers and supported by donators. Don’t overuse it to keep it free and available to everybody. Don’t run complex
computations like neural network training on Binder.
The JupyterLab session on Binder allows for code editing and repeated execution. You may also save your files there,
but they will be lost as soon as you end the session. Don’t forget to download modified files to your local machine
before you leave.
13 https://jupyter.org
14 https://mybinder.org
15 https://mybinder.org
Gauss16 is a GPU server at Zwickau University of Applied Sciences only available in the university’s intranet.
Students with access to Gauss should use Gauss instead of Binder. The Gauss launch button runs the book’s Jupyter
rendering in JupyterLab on Gauss very similar to Binder.
Fig. 1.4: Gauss asks for username and password before launching a book’s page in JupyterLab.
A click on the Gauss launch button copies the whole GitLab repository17 of the book to the user’s personal directory on
Gauss. Thus, modifications to code and other files are saved to the user’s directory, too, and are persistent. Repeated
clicks on the Gauss launch button do not overwrite a user’s modifications, but may update files untouched by the user
but modified in the GitLab repository. Thus, the user’s version will always be up-to-date while preserving the user’s
modifications as far as possible. For details on the merge process run when clicking the Gauss launch button see
Automatic Merging Behavior18 in nbgitpuller’s documentation.
The Live Code button makes code cells editable and executable on-the-spot using Thebe19 . Clicking the Live Code
button starts a Python kernel on mybinder.org20 and connects the book’s HTML rendering to that kernel. Progress
and success of the startup process are shown below the page’s heading.
Fig. 1.5: A box with progress information appears after clicking the Live Code button.
Code cells on the page change their appearance. Outputs now belong to the cell, some buttons appear, and the code
becomes editable.
Cells are not run immediately after clicking the Live Code button. Clicking the ‘run’ button executes the cell. Alter-
natively, one may run all cells on a page by clicking the ‘restart & run all’ button.
16 https://gauss.fh-zwickau.de
17 https://gitlab.hrz.tu-chemnitz.de/jef19jdw--fh-zwickau.de/data-science-ai
18 https://jupyterhub.github.io/nbgitpuller/topic/automatic-merging.html
19 https://github.com/executablebooks/thebe
20 https://mybinder.org
Fig. 1.6: After lauching Live Code each code cell shows buttons for starting and controlling code execution.
To execute the book’s Python code on your local machine download a page’s Jupyter rendering by clicking the down-
load button in the upper right corner of the HTML rendering or clone the book’s Git repository21 to your machine.
Fig. 1.7: Hovering the download symbol shows a list of available formats.
On your machine you need JupyterLab22 or a similar tool from the Jupyter ecosystem to view and modify the ipynb
files. For install instructions have a look at the Install Jupyter Locally (page 289) project.
1.3 Contribute
This book is not static. It will grow over time and existing material will be rearranged, updated, reduced or extended
as required by future developments in data science, AI and teaching. In this sence, it’s a living or dynamic book. And
you, the reader, are invited to contribute to the book.
The book’s HTML rendering shows a contribution button in the upper right corner offering several options. Those
options will be discussed in detail below.
Fig. 1.8: Hovering over the contribution button shows all options for contributing to the book.
21 https://gitlab.hrz.tu-chemnitz.de/jef19jdw--fh-zwickau.de/data-science-ai
22 https://jupyter.org
1.3. Contribute 9
Data Science and Artificial Intelligence for Undergraduates
The repository button is a simple link to the book’s GitLab repository. There you find all source files needed to render
the book. If you are familiar with Git and GitLab the repository button is a good starting point for contributing to
the project. If you are not familiar with Git and GitLab, don’t worry and use one of the other contribution options.
Fig. 1.9: The repository button leads to a page in GitLab with information about the project, including information
about the most recent update.
Important: The book’s public Git repository is hosted on a GitLab instance provided by Chemnitz University
of Technology23 for all Saxon universities. Actions requiring a user account are restricted to members of Saxon
universities.
The open issue button allows to ask questions or report bugs, typos and so on. Clicking this button opens a new issue
in the book’s GitLab repository. Put your question, bug report, what ever in the description field and click the ‘Create
issue’ button.
The author will have a look at the issue as soon as possible and post an answer. Depending on your GitLab ac-
count’s configuration you will receive email notifications if someone posts a comment on your issue. Each reader
may comment on each other reader’s issues. So readers may help other readers to solve problems where appropriate.
If you spot a typo or if you would like to add some explanatory note or an additional code example you should hit
the ‘suggest edit’ button. The button opens the current page for editing in GitLab. The syntax of the text file is
MyST markdown24 and should be self-explanatory. If your edit contains more than a typo correction, leave a commit
message summarizing your edit. Then click ‘Commit changes’.
On the next page click ‘Create merge request’. This will send a notification to the author that somebody suggested an
edit.
The author will have a look at your suggestion as soon as possible. GitLab’s merge request feature allows for discussing
and modifying the edit if necessary before the author merges the edit into the book’s source and, thus, into the
published book.
23 https://www.tu-chemnitz.de
24 https://myst-parser.readthedocs.io
Fig. 1.10: To open an issue simply type a description and click on ‘Create issue’. The title field is prefilled and should
remain untouched.
1.3. Contribute 11
Data Science and Artificial Intelligence for Undergraduates
Fig. 1.11: Edit the pages markdown source and leave a commit message to describe more elaborate edits.
Fig. 1.12: Simply click ‘Create merge request’. Fill the description field if you feel a need for additional explanation
next to your commit message.
1.3. Contribute 13
Data Science and Artificial Intelligence for Undergraduates
If you don’t have an account at the book’s GitLab instance feel free to send issues and suggestions for edits to the
author25 by email.
25 https://www.fh-zwickau.de/~jef19jdw
TWO
GUIDED READING
The teaching material in this book, including exercises and projects, may be arranged in different ways to meet the
needs of a comprehensive lecture series or to accompany a one-day workshop on machine learning, for instance.
In each section the chapter presents a selection of material together with hints on working through it. Students of
Zwickau University’s data science course will find the material for each semester here.
• Data Science I (course at WHZ) (page 15)
• Data Science II (course at WHZ) (page 20)
• Data Science III (course at WHZ) (page 20)
• Data Science IV (course at WHZ) (page 20)
The first part of the data science lecture series introduces the Python programming language and some Python libraries
required for data processing. Next to Python the focus is on working with big data, obtaining, understanding and
restructuring data, as well as extracting basic statistical information from data.
2.1.1 Warm-Up
Week 1
Lectures
• Data Science, AI, Machine Learning (page 23)
• Python and Jupyter (page 27)
• Computers and Programming (page 33)
Self-study
• Computer Basics (page 239) (exercises)
• Install and Use Python (page 285)
– Working with JupyterLab (page 285) (project)
Practice session
• Install and Use Python (page 285)
– Install Jupyter Locally (page 289) (project)
15
Data Science and Artificial Intelligence for Undergraduates
Lectures
• Crash Course (page 43)
– A First Python Program (page 43)
– Building Blocks (page 46)
Self-study
• Finding Errors (page 243) (exercises)
• Basics (page 249) (exercises)
Practice session
• Install and Use Python (page 285)
– Python Without Jupyter (page 292) (project)
• Python Programming (page 297)
– Simple List Algorithms (page 297) (project)
Lectures
• Crash Course (page 43), continued
– Screen IO (page 56)
– Library Code (page 57)
– Everything is an Object (page 59)
Self-study
• More Basics (page 251) (exercises, last task is bonus)
Practice session
• Python Programming (page 297)
– Geometric Objects (page 300) (project, last section is bonus)
Lectures
• Variables and Operators (page 75)
– Names and Objects (page 75)
– Types (page 80)
– Operators (page 85) (section Operators as Member Functions (page 88) is bonus)
– Efficiency (page 89) (all but section Garbage Collection (page 91) is bonus)
Self-study
• Variables and Operators (page 253) (exercises)
• Memory Management (page 256) (exercises, all but the last two tasks are considered bonus)
Practice session
Lectures
• Lists and Friends (page 93)
– Tuples (page 93)
– Lists (page 95)
– Dictionaries (page 98)
– Iterable Objects (page 98)
• Strings (page 105)
– Basics (page 105)
– Special Characters (page 107)
– String Formatting (page 109)
Self-study
• Lists and Friends (page 259) (exercises)
• Strings (page 262) (exercises, last one is bonus)
Practice session
• MNIST Character Recognition (page 311)
– The xMNIST Family of Data Sets (page 311) (project)
Lectures
• Accessing Data (page 111)
– File IO (page 111)
– Text Files (page 114)
– ZIP Files (page 117)
– CSV Files (page 118)
– HTML Files (page 119)
– XML Files (page 121)
– Web Access (page 122)
Self-study
• File Access (page 264) (exercises)
Practice session
• Cafeteria (page 315), download part (project)
Lectures
• Functions (page 127)
– Basics (page 127)
– Passing Arguments (page 128)
– Anonymous Functions (Lambdas) (page 132)
– Function and Method Objects (page 132)
– Recursion (page 133)
• Modules and Packages (page 135)
Self-study
• Functions (page 265) (exercises)
Practice session
• Cafeteria (page 315), parsing part (project)
Lectures
• Error Handling and Debugging Overview (page 139)
• Inheritance (page 143) (last section Exceptions Inherit from Exception (page 147) is bonus)
• Unified Modeling Language (UML) (page 339) (bonus)
Self-study
• Object-Oriented Programming (page 267) (exercises)
• Further Python Features (page 149) (bonus reading)
Practice session
• Weather (page 305)
– Getting Forecasts (page 307), download part (project)
Lectures
• Efficient Computations with NumPy (page 155)
– NumPy Arrays (page 155)
– Array Operations (page 161)
– Advanced Indexing (page 165)
– Vectorization (page 166)
Self-study
• NumPy Basics (page 269) (exercises)
Practice session
• Weather (page 305)
– Getting Forecasts (page 307), parsing part (project, automatic download is bonus)
Lectures
• Efficient Computations with NumPy (page 155), continued
– Array Manipulation Functions (page 168)
– Copies and Views (page 171)
– Efficiency Considerations (page 173)
– Special Floats (page 175)
– Linear Algebra Functions (page 177)
– Random Numbers (page 179)
• Saving and Loading Non-Standard Data (page 181)
Self-study
• Image Processing with NumPy (page 271) (exercises, last one is bonus)
Practice session
• MNIST Character Recognition (page 311)
– Load QMNIST (page 313) (project)
Lectures
• High-Level Data Management with Pandas (page 187)
– Series (page 188)
– Data Frames (page 198)
Self-study
• Pandas Basics (page 273) (exercises)
Practice session
• Public Transport (page 317)
– Get Data and Set Up the Environment (page 317) (project)
Lectures
• High-Level Data Management with Pandas (page 187), continued
– Advanced Indexing (page 207)
– Dates and Times (page 217)
Self-study
• Pandas Indexing (page 276) (exercises)
Practice session
• Corona Deaths (page 325) (project)
Lectures
• High-Level Data Management with Pandas (page 187), continued
– Categorical Data (page 223)
– Restructuring Data (page 227)
– Performance Issues (page 233)
Self-study
• Advanced Pandas (page 278) (exercises)
• Pandas Vectorization (page 281) (exercises)
Practice session
• Weather (page 305)
– Climate Change (page 309) (project)
The second semester of the data science lecture series starts with visualization techniques. Then supervised ma-
chine learning for generating predictions from data is introduced. Linear regression and artificial neural networks are
discussed in depth.
to be continued…
Part three of the data science lecture series continues discussion of supervised machine learning. Further methods
like decision trees and support vector machines are introduced. Then we move on to unsupervised machine learning
covering clustering methods and techniques for dimensionality reduction.
to be continued…
The last part of the data science lecture series is devoted to reinforcement learning. Next to very basic techniques we
also discuss state-of-the-art deep reinforcement learning with artificial neural networks.
to be continued…
Warm-Up
21
CHAPTER
THREE
Data Science comes in different flavors and sometimes denotes different things. Some clarification on the terms used
in this book and on the subjects covered is mandatory.
With the advent of cheap storage devices in the last decade of the 20th century companies, governments, other
organizations and also private individuals started collecting data at large scale (big data). In a world full of data
somebody has to think about how to make information accessible which is hidden in data. Computer Scientists and
Mathematicians developed a bunch of methods for extracting information, more and more applications popped up,
methods became more complex,… a new field of research was born. This new field matured, got the name ‘data
science’ and now is accepted as serious field of research and teaching.
Data Science as a science field covers all technical aspects of data processing. There’s large overlap with computer sci-
ence and mathematics, but also with many other fields, depending on where data comes from. Mathematics provides
advanced methods for extracting information from data. Computer science allows for their realization.
Data Science also touches law, ethics and sociology. May I use this data set for my project? Is it okay to collect and
dig through personal data? What impact will extensive data collection and processing have on society?
Almost every data science project has four phases:
1. Collect Data
Data has to be recorded and stored somehow. Planning and realizing data collection processes is referred to as data
engineering. Typical tasks in this phase are, for instance, installing and configuring sensors, setting up data base
storage, and implementing techniques for supervising data flow.
2. Clean and Restructure Data
Raw data sets often contain errors, missing items or false items. They have to be cleaned. Almost always several data
sets have to be combined to allow for succesful extraction of information. These preprocessing steps require lots of
manual work and domain knowledge. Careful preprocessing will simplify subsequent processing steps and is at least
as important as the modeling phase.
3. Create a Model
From recorded and preprocessed data a mathematical or algorithmic model is build. Depending on the concrete
problem to solve from data, such a model may describe the data set (descriptive model) or it may be used to answer
some question based on the data set (predictive model).
4. Communicate Results
Findings from the data have to be communicated to the client. Visualizations are the most important tool for delivering
results.
In this book we focus on preprocessing and modeling. Data engineering and communication of results will be touched
occasionally only. The visualization aspect of communication also plays an important role in preprocessing when
exploring a new data set (explorative data analysis, short: EDA). So we will cover the full range of visualization tools
and techniques there.
23
Data Science and Artificial Intelligence for Undergraduates
Brick-and-mortar stores as well as online shops collect as much customer data as they can to understand customer
behavior. Knowing how many people buy which products at which time in which quantities is essential for efficient
warehousing. But customer data is also used for targeted ad campaigns.
For targeted advertising one tries to identify groups of customers with similar behavior. For each group tailor-made
ads are created. Customer segmentation is an example of descriptive modeling. The aim is to understand the collected
data and to find structures not obvious at first glance.
Typical tasks in the four phases described above are:
1. Collect Data
• implement a network infrastructure to collect sales data from all stores in a central data base
• issue customer cards to know who comes to your shop (age, gender, location,…)
• think about buying external data about your customers (Schufa,…)
• check legal situation to know whether you are allowed to collect the data you want
2. Clean and Restructure Data
• throw away all the data not relevant for segmentation (for instance, data of customers not living in the targeted
region)
• transform data (for instance, convert absolute quantities to relative quantities: milk made 5% of the shopping
cart)
• restructure data to get per-customer data instead of per-shop or per-product
3. Create a Model
• apply some standard segmentation method
• try to understand the identified customer groups, find unique characteristics
4. Communicate Results
• present groups and their unique characteristics to the advertising department
Weather forecasting is a typical example of predictive modeling. From past data we want to create a model which
yields information on future weather parameters. In the past lots of experts analyzed recorded weather data and
made predictions mainly from experience and classical mathematical and physical modeling. Data science allows to
automate the forecasting process. Instead of handcrafted models and expert knowledge one creates a predictive data
model based on all (or sufficiently much) recorded weather data.
1. Collect Data
• decide which weather parameters to record (temperature, humidity,…)
• implement a network infrastructure to collect weather data from across the world
• build and launch satellites
• build terrestrial weather stations
2. Clean and Restructure Data
• decide for a subset of data to use for forecasting (for instance, only use data from past 30 days)
• transform data (for instance, harmonize temperature units: Fahrenheit, Celsius)
• restructure data (for instance, downsample data from 5-minute periods to hourly values)
3. Create a Model
Artificial intelligence to some extent is a buzzword. It’s used for computer programs doing things we consider in-
telligent. Examples are image classification (what is shown on the image?), language processing (translate a text),
autonomous driving (orient and move in a complex environment). Under the hood there’s still a classical computer
program, no intelligence.
Most, if not all, methods related to artifical intelligence are based on processing large data sets. Image and language
processing systems are trained on large data sets of sample images and sample texts. Autonomous driving uses
reinforcement learning, which can be understood as collecting large amounts of data while exploiting information
extracted from previously collected data (data collection on demand). In this sence, artificial intelligence is a subfield
of data science. In this book we also cover this vague field of articifial intelligence, including reinforcement learning.
There’s also a strict mathematical definition of artificial intelligence: A computer system is intelligent if it passes the
Turing test. In the Turing test a human chats with another human and the computer system in parallel. If the human
cannot decide which of both chat partners is human, the computer passed the test. Up to now no computer passed
the Turing tests. If interested, have a look at Wikipedia’s article on the Turing test26 .
By machine learning we denote the process of writing computer programs ‘learning’ to do something from data. In
other words, we set-up a model with lots of unknowns and then fit the model to our data. So machine learning refers
to a style of software development. We do not write a program line by line. Instead we use a general purpose program
and fill in the details automatically based on some data set.
Machine learning may be regarded as the hard core of data science and artificial intelligence, where all the mathematics
is contained in.
26 https://en.wikipedia.org/wiki/Turing_test
27 https://xkcd.com/1838
Fig. 3.1: The pile gets soaked with data and starts to get mushy over time, so it’s technically recurrent. Source:
Randall Munroe, xkcd.com/183827
FOUR
In this book we use the Python28 programming language for talking to the computer. Tools from the Jupyter29
ecosystem allow for Python programming in a very comfortable graphical environment.
There are lots of software tools for data science and artificial intelligence. They can be devided into two groups:
Tailor-made GUI tools
For common tasks in data science and AI like clustering data or classifying images there exist (mostly commerical)
tools with graphical user interface (GUI). Such tools are easy to use, but they have a very limited scope of application.
Each task requires a different tool. Available methods are restricted to well known ones. Implementing new problem
specific methods is not possible.
General Purpose Tools
To enjoy maximum freedom in choice of methods one has to leave the world of GUI tools. Creating data science
models (that is, computer programs) without any restrictions requires the use of some high-level programming lan-
guage. Examples are R30 and Python31 . Both languages are very common in the data science community because
they ship with lots of extensions for simple usage in data science and AI.
Tailor-made tools come and go as time moves on. Programming languages are much more long-lasting. In this book
we stick to the Python programming language and its ecosystem. The R programming language would be a good
alternative, but sticks more to statistical tasks than to general purpose programming.
Tip: Some people feel frightend if someone says ‘programming language’. Think of programming languages as
usual software tools. The only difference is that they provide much more functionality than GUI tools. But there’s
not enough space on screen to have a button for each function. So we write text commands.
Python is a modern, free and open source programming language. It dates back to the early 1990s with a first official
release in 1994. It’s father and BDFL (benevolent dictator for life) is Guido van Rossum32 .
Python code is very readable and straight forward without too many cumbersome symbols like in most other pro-
gramming languages. Many technical aspects of computer programming are managed by Python instead of by the
programmer. With Python one may develop the full range of software, from simple scripts to fully featured web or
desktop applications. Thousands of extensions allow for rapid development.
28 https://www.python.org
29 https://jupyter.org
30 https://www.r-project.org/
31 https://www.python.org
32 https://en.wikipedia.org/wiki/Guido_van_Rossum
33 https://xkcd.com/353
27
Data Science and Artificial Intelligence for Undergraduates
Fig. 4.1: I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I’m leaving you. Source: Randall
Munroe, xkcd.com/35333
There’s a large online community discussing Python topics. Almost every problem you’ll encounter has already been
solved. Simply use a search engine to find the answer.
Fig. 4.2: Popuparity of programming languages on Stack Overflow34 . Source: Stack Overflow Trends35 (modified
by the author)
Some rules followed by Python and its community are collected in the Zen of Python36 . Here are some of them:
• Beautiful is better than ugly.
• Explicit is better than implicit.
• Simple is better than complex.
• Complex is better than complicated.
• Readability counts.
• There should be one – and preferably only one – obvious way to do it.
• Although that way may not be obvious at first unless you’re Dutch.
• If the implementation is hard to explain, it’s a bad idea.
• If the implementation is easy to explain, it may be a good idea.
Last but not least Python is available on all platforms, Linux, macOS, Windows, and many more. Youtube’s player is
written in Python and many other tech giants use Python. But it’s also not unlikely that a Python script controls your
washing machine.
Hint: There are two versions of Python: Python 2 and Python 3. Source code is not compatible, that is, there are
programs written in Python 2 which cannot be executed by a Python 3 interpreter. In this book we stick to Python
3. Python 2 is considered deprecated since January 202037 .
34 https://stackoverflow.com
35 https://insights.stackoverflow.com/trends?tags=python%2Cjavascript%2Cjava%2Cc%23%2Cphp%2Cc%2B%2B%2Cr
36 https://en.wikipedia.org/wiki/Zen_of_Python
37 https://www.python.org/doc/sunset-python-2/
4.3 Jupyter
The Jupyter ecosystem is a collection of tools for Python programming with emphasis on data science. Jupyter allows
for Python programming in a webbrowser. Outputs, including complex and interactive visualizations, can be put right
below the code producing these outputs. Everything is in one document: code, outputs, text, images,…
Fig. 4.3: JupyterLab is the most widely used member of the Jupyter ecosystem. It brings Python to the webbrowser.
In this book you’ll meet at least four members of the Jupyter ecosystem.
JupyterLab38
JupyterLab is a web application bringing Python programming to the browser. It’s the everyday tool for data science.
JupyterLab may run on a remote server (cloud) or on your local machine.
Jupyter Notebook39
An alternative to JupyterLab is Jupyter Notebook. It’s a predecessor of JupyterLab and provides almost identical
functionality, but with different look and feel.
JupyterHub40
Running JupyterLab in the cloud requires user authentication and user management. JupyterHub provides everything
we need to run several JupyterLabs on a server in parallel. Almost all JupyterLab providers (e.g., Gauss at Zwickau
University, Binder) rely on JupyterHub.
Jupyter Book41
This book is being published using Jupyter Book. Each page is a Jupyter notebook file. Jupyter Book provides
automatic generation of table of contents, handling bibliographies and rendering to different output formats.
38 https://jupyter.org
39 https://jupyter.org
40 https://jupyter.org/hub
41 https://jupyterbook.org
Work through the following projects to get up and running with Python and Jupyter:
• Working with JupyterLab (page 285)
• Install Jupyter Locally (page 289)
• Python Without Jupyter (page 292)
FIVE
Computers are the main tool for data science and artificial intelligence. In this chapter we answer some basic questions:
• What is a computer?
• What are bits, bytes, kilobytes,…?
• What is software?
• What is programming and what are programming languages?
Although we don’t have to know detailed answers to these questions, we should have some rudimentary understanding
of what happens inside a computer.
Related exercises: Computer Basics (page 239).
Each modern computer consists of three components: central processing unit (CPU), memory, input/output (IO)
devices. These components are connected by many wires, which are organized together with some auxiliary stuff on
the computer’s mainboard.
Fig. 5.1: There’s a tight and fast data connection between CPU and memory. IO devices are connected to the CPU
and in some cases also directly to memory.
33
Data Science and Artificial Intelligence for Undergraduates
IO devices are all parts of the computer which provide an interface to humans like screen, keyboard, printer, scanner.
But also mass storage devices (hard disk drives, SSDs, DVD drives, card readers and so on) are IO devices. Another
kind of IO devices are network adapters for Ethernet, Wi-Fi, Bluetooth and others. The common feature of all IO
devices is that they produce and/or consume streams of binary data. ‘Binary’ means that there are only two different
values, usually denoted by 0 and 1. Electrically, 0 might stand for low voltage and 1 for high voltage.
Memory can store streams of binary data. In some sense it is similar to mass storage IO devices, but it is used in a
very different way. Most storage devices are very slow and access times for reading and writing data depend on the
position of the data on the device. In contrast, memory access is very fast and access times are independent of the
data’s concrete location. Whenever data has to be stored for a short time only, memory is used. Due to technological
reasons memory loses all data when power is turned off, whereas data on mass storage devices persists.
The CPU is a highly integrated circuit which processes streams of binary data. ‘Processing’ means that incoming data
from memory and/or IO devices is transformed and then sent to memory and/or IO devices. If a binary stream from
memory is interpreted as instructions by the CPU, then we say that the stream contains code. Data in the stricter
sense refers to parts of binary streams that are processed by the CPU, but which do not tell the CPU what to do.
Memory can contain code and data, whereas IO devices only produce non-code data.
A bit is a piece of binary information. It either holds a one or a zero. Less information than a bit is no information.
With a sequence of 𝑘 bits we can express 2𝑘 different values, for example the numbers 0, 1, … , 2𝑘 − 1.
Fig. 5.2: Wtih 3 bits we may represent 8 different values. Each additional bit doubles the number of possible values.
By convention binary data in modern computers is organized in groups of 8 bits. A sequence of 8 bits is denoted as
a byte.
Following the metric system, there are kilobytes (1000 byte), megabytes (1000 kilobyte), gigabytes (1000 megabytes),
and so on with prefixes tera, peta, exa, zetta, yotta. Corresponding symbols are kB or KB, MB, GB, TB, PB, EB, ZB,
YB.
In some hardware oriented fields of computer science it is common practice to use the factor 1024=210 instead of
1000. Thus, the size of a kilobyte may be 1000 or 1024 bytes. As a rule of thumb 1000 is used for data transmission
and 1024 is used for memory and storage related things (except in adds for storage devices, because 1024 would
give a lower number of gigabytes). Sometimes the prefixes kibi, mebi, gibi, tebi, pebi, exbi, zebi, yobi are used with
corresponding symbols KiB, MiB, GiB, TiB, PiB, EiB, ZiB, YiB for factor 1024. One kibibyte, for instance, has
1024 bytes.
Important: Computers only work with binary data. Everything has to be represented as sequences of zeros and
ones. For integers, like 123, this is quite simple (see below). Rational numbers, like 0.123, may be represented by
two integers, a numerator 123 and a denominator 1000 for instance. But what about text data? Or images?
Data which cannot be represented as sequence of zeros and ones cannot be processed by a computer. We’ll
come back to this representation issue several times in this book.
Numbers may have a name, like one, two, three, four, five, six, seven, eight, nine, ten. There are even more named
numbers: eleven, twelve and zero, for instance. Obviously, not all numbers can have an individual name. We need a
system for automatically naming numbers and also for writing them down. At this point it is important to distinguish
between numbers, which can be used for counting and computations, and their representation in spoken and written
language.
In everyday life we use the decimal system based on 10 different digit symbols because we have 10 fingers. An
octopus surely would invent a numbering system with only 8 digits. The Maya civilization used a 20 digits system
(fingers plus toes). Computers would have invented number systems based on 2 digits, because they are representable
by 1 bit, or 4 digits (2 bits) or 8 (3 bits) or 16 (4 bits).
42 https://en.wikipedia.org/wiki/Dresden_Codex
43 https://commons.wikimedia.org/wiki/File:Maya_Hieroglyphs_Plate_32.jpg
Fig. 5.3: Maya numerals on a page of a Maya book known as Dresden Codex42 . Source43 : Sylvanus Morley via
Wikimedia Commons, modified by the author.
There are many systems for writing down (and naming) numbers. Today the most widely used ones are positional.
An example for a non-positional system are Roman numerals.
Fix a number 𝑏, the basis, and take 𝑏 symbols to denote the first 𝑏 numbers. Here, we interpret zero as the first number,
followed by one as the second number and so on. In case 𝑏 is less than ten, we may use the symbols 0, 1, … , 9 for the
numbers from zero to nine. Every number 𝑐 has a unique representation of the form
𝑐 = 𝑎 𝑛 𝑏 𝑛 + ⋯ + 𝑎 2 𝑏 2 + 𝑎1 𝑏 1 + 𝑎 0 𝑏 0 ,
where 𝑎0 , 𝑎1 , … , 𝑎𝑛 ∈ {0, 1, … , 9} are the digits and 𝑛 + 1 is the number of digits required to express the number 𝑐
with respect to the basis 𝑏. With this unique representation at hand, we may write the number 𝑐 as a list of its digits:
𝑐 = 𝑎𝑛 … 𝑎0 . Keep in mind that the basis 𝑏 has to be known to interpret a list a digits although 𝑏 often is not written
down explicitly.
Example: If we take, for instance, the number twelve, then with ten as base 𝑏 we would have
twelve = 1 ⋅ 𝑏1 + 2 ⋅ 𝑏0 = 12.
Numbers given in base ten are denoted as decimal numbers. More exactly, one should say ‘a number in decimal
representation’ since the number itself does not care about how we write it down.
To avoid confusion, each number we write down without explicitly specifying a basis is to be understood as a decimal
number. Numbers in a basis other than ten always will come with some hint on the basis.
Numbers in positional representation with base 2 are called binary numbers. They frequently appear in computer
engineering. Symbols for the two digits are 0, 1 and sometimes the letters O, I.
Example: Number twelve in binary representation is
Base 8 yields octal numbers. For octal numbers the usual digits 0 to 8 can be used.
Example:
12 = 1 ⋅ 81 + 4 ⋅ 80 = 14 (octal).
Octal numbers occur for instance in file access permission on Unix-like systems because access is controlled by 3 sets
(owner, group, all) of 3 bits (read, write, execute). Thus, all possible combinations can be conveniently expressed by
three-digit octal numbers.
Example: Access right 750 (which is 111 101 000 in binary notation) says that the file’s owner may read, write and
execute the file. The owner’s group is not allowed to write to the file (only read and execute). All other users do not
have any access right.
Base 16 yields hexadecimal numbers. For hexadecimal numbers we use 0 to 9 followed by the symbols a, b, c, d, e,
f to denote the digits.
Examples:
12 = 12 ⋅ 160 = c (hexadecimal),
125 = 7 ⋅ 16 + 13 ⋅ 160 = 7d (hexadecimal).
1
Note that letters a to f might be digits of a hexadecimal number as well as variable names. Have a look at the context
to get the correct meaning. Sometimes capital letters A, B, C, D, E, F are used.
Hexadecimal numbers occur in many different situations because the range 0 to 255 of a byte value maps exactly to
the set of all two-digit hexadecimal numbers: 00 to ff. We will meet this notation when specifying colors.
Example: The color value ff c0 60 yields a light orange (100% red intensity, 75% green intensity, 38% blue intensity).
Fig. 5.4: Professional graphics programs show hexadecimal color values, often denoted as ‘HTML notation’, because
hexadecimal color values frequently occur in HTML44 code for websites.
Software is a stream of binary data to be read and processed by the CPU. The task of a software developer is to
generate streams of binary data which make the CPU do what the software developer wants it to do.
Hint: ‘Binary data’ has at least two different meanings, depending on the context.
• In programming contexts, where we have to distinguish between computer and human readable data, data is
considered ‘binary’ if it has no useful interpretation as text.
• In more general contexts, data is considered ‘binary’ if it is or can be represented as a sequence of zeros and
ones. In this sense, a picture is not binary data, but a digital copy consisting of pixels instead of brushstrokes
is binary data.
Modern software has a size of several megabytes or even gigabytes. It isn’t impossible for humans to generate such
large and complex amounts of binary data by hand. Instead, the process of software development has been auto-
mated step by step beginning from scratch with directly coding zeros and ones in the 1950s up to nowadays higher
programming languages.
44 https://en.wikipedia.org/wiki/HTML
5.4.1 Assemblers
A first step of automation has been the invention of assemblers. That are computer programs which transform a
set of to some extent human readable codewords to a sequence of zeros and ones processable by a CPU. Here is an
example:
The first line tells the CPU to read 4 bytes from memory address 120 and to store them in one of its registers (a kind
of internal memory). Second line does the same, but with memory address 124 and a second CPU register. Then the
CPU is told to add both values. The CPU stores the result in its eax register. The last line makes the CPU write the
result of addition to memory address 128.
Writing computer programs in assembler code made software development much easier. But due to the very limited
instruction set reflecting one-to-one the instruction set of the CPU, programs are hard to read and tightly bound
to the hardware they were designed for. The only advantage of assembler code compared to modern programming
languages is its speed of execution and its small size after transforming it to binary code. The first initialization routine
of modern operating systems is still written in assembler code, because it has to fit into a small predefined portion of
a storage device called boot sector.
A further step in the evolution of programming languages are languages for structured programming. Examples are C,
BASIC, Pascal. Here the hardware is almost completely abstracted and a relatively complex program, the compiler,
is needed to transform the source code written by the software developer to binary code for the CPU. Here is a
snipped of a C program:
int a, b;
a = 5;
b = 10 * a + 7;
printf("result is %i", b);
The first line tells the compiler that we need two places in memory for storing integer values. The second line makes
the CPU move the value 5 to the place in memory referenced by a. Third line makes the CPU do some calculations
and store the result in memory referenced by b. Finally, the result shall be printed on screen. Writing this in assembler
code would require some hundred lines of code and we would have to take care of memory organization (where is
free space?) and of the instruction set of the CPU. Both is done by the compiler. Especially the C language is still of
great importance. It is used, for example, for large parts of Linux and Windows.
A further layer of abstraction is object-oriented programming. Instead of handling hundreds of variables and hundreds
of functions for their processing, everything is organized in a well structured way reflecting the structure of the real
world. Examples of programming languages allowing for object-oriented programming are C++, Java, Python.
Source code of a computer program either is compiled or interpreted. Compiling means that the source code is
translated to binary code and after finishing this translation it can be executed, that is, fed to the CPU. Interpretation
means that the source code is translated line by line and each translated line is sent immediately to the processor.
Compiled programs run much faster than interpreted ones. But interpreted programs allow for simpler debugging
and more intuitive elements in the programming language. Sometimes interpreted programs are called scripts and
corresponding languages are denoted as scripting languages. C and C++ are compiled languages whereas Python is
interpreted. Java is somewhere in between.
41
CHAPTER
SIX
CRASH COURSE
This chapter provides an overview of everyday Python features and their basic usage.
• A First Python Program (page 43)
• Building Blocks (page 46)
• Screen IO (page 56)
• Library Code (page 57)
• Everything is an Object (page 59)
Related exercises:
• Finding Errors (page 243)
• Basics (page 249)
• Simple List Algorithms (page 297)
• More Basics (page 251)
• Geometric Objects (page 300)
We start the Python crash course with a small program. The program will ask the user for some keyboard input. If
the user types bye the program stops, else it asks again.
code = None
if code == "":
print("To lazy to type?")
print("")
print("Bye")
43
Data Science and Artificial Intelligence for Undergraduates
code = None
provides space in memory to store something (this is quite unprecise, but will be made precise soon). We name this
piece of memory code since we want to store a code typed by the user. At the moment there is nothing to store,
which is expressed by None.
The second line
is the head of a loop. The subsequent indented code block is executed again and again as long as the condition code
!= "bye" is satisfied. Here, != means unequal. Thus the line
print("Bye")
is not executed before the variable code holds the value bye.
waits for user input, which then is stored in the variable code. Text following a # symbol is ignored by the Python
interpreter.
if code == "":
print("")
The line
print("Bye")
is only executed if the user types bye. Then, after executing this, the program stops. There is no other way for the
user to stop the program (next to killing it with the operating system’s help).
• Contrary to most other programming languages, code indentation matters. Subblocks of code have to be
indented to be recognized as such by the Python interpreter. Usual indentation width is 4 spaces, but other
widths and also tabs may be used as long as indentation is consistent throughout the file.
• There is something we call a function. Functions in our script are print and input. We can pass arguments
to functions to influence their behavior (what to print?) and functions may return some value. The latter is the
case for the input function. The return value is stored in the variable code. Note, that even if we don’t want
to pass arguments to a function, we have to write parentheses: input().
6.1.4 Errors
code = None
if code == "":
print("To lazy to type?")
print("")
print("Bye")
Input In [2]
print("I'm a Python program. Type 'bye' to stop me:"
^
SyntaxError: '(' was never closed
The next code example contains a semantic error not detectable by the interpreter: instead of == there is != in the if
statement, which checks for inequality instead of equality. Thus, the program will print the To lazy to type?
message if the user has typed something. No message will appear if the user has provided no input. That’s not what
we want the program to do.
code = None
if code != "":
print("To lazy to type?")
print("")
(continues on next page)
print("Bye")
Important: Writing code is not too hard. But writing code without errors is almost impossible. Finding and
correcting errors is an essential task and will consume considerable resources (time and nerves). Some advice for
finding errors:
• Always read and understand the error message (if there is one).
• If you do not understand an error message, ask a search engine for explanation.
• Line numbers shown in an error message often are correct, but sometimes the erroneous code is located above
(not below) the indicated position.
• Test run your code as often as possible. Write some lines, then test these lines. The less code you write between
tests, the more obvious is the reason for an error.
• For each error there is a solution. You just have to find it.
• The Python interpreter is always right! If it says, that there is an error, then THERE IS AN ERROR.
We start our quick run through Python with essential features which can be found in almost every high-level program-
ming language. What we will meet here is known as structured programming. Later on we will move on to object
oriented programming.
6.2.1 Comments
A Python source code file may contain text ignored by the Python interpreter. Such comments help to understand and
document the source code. Everything following a # symbol is ignored by the interpreter.
a = 1
b = 2
Note that empty lines do not matter. We may place empty lines everywhere like comments to make the code more
readable.
6.2.2 Assignments
Data (numbers, strings and other) in memory can be associated to a human readable string. Such a combination of a
piece of data and a name for it is known as variable. To assign a name to a piece of data Python uses the = sign.
a = 1
The above code writes the number 1 to some location in memory and assigns the name a to it. Whenever we want
to use or modify this value we simply have to provide its name a. The Python interpreter translates the name into a
memory address.
print(a)
We have to distiguish different types of data because each type comes with its own set of operations. Numbers can
be added and multiplied, for example, whereas strings can be concatenated but not multiplied. In Python we do not
have to care too much about choosing the correct data type, because the interpreter does much of the technical stuff
(e.g., how much memory is required?) for us.
a = 2 # an integer
b = 2.1 # a floating point number
c = "Hello!" # a string
d = True # a boolean value
print(a)
print(b)
print(c)
print(d)
2
2.1
Hello!
True
Integers
Note: In most programming languages there is a maxmimum value an integer can attain, like −231 , ..., 231 + 1. In
Python there is no limit on the size of an integer.
a = 5 + 2 # addition
b = 5 - 2 # substraction
c = 5 * 2 # multiplication
d = 5 // 2 # floor division
e = 5 % 2 # remainder of devision
f = 5 / 2 # division (yields a floating point number)
g = 2 ** 5 # power
7
3
10
2
1
2.5
32
a = 1 // 0
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 a = 1 // 0
If we want the user to input an integer, we may use the following code
Like print and input also int is a function. It takes a string and converts it to an integer. If this is not possible,
an error message appears and program execution is stopped.
Python supports floating point numbers (also known as floats) in the approximate range 1e-308…1e+308 with 15
significant decimal places (double precision in IEEE 754 standard45 ). Floating point numbers are stored as a pair of
coefficient and exponent of 2, where both coefficient and exponent are integers.
Example: 0.1875 = 3 ⋅ 2−4 with coefficient 3 and exponent -4.
Important: Most decimal fractions cannot be represented exactly as float, which may cause tiny errors in compu-
tations.
Example:
See Python documentation46 for more detailed explanation and additional examples.
45 https://en.wikipedia.org/wiki/IEEE_754
46 https://docs.python.org/3/tutorial/floatingpoint.html
print(a)
print(b)
print(c)
print(d)
5
5.0
5.123
7.123
Note that Python converts data types automatically as needed. Destination type is chosen to prevent loss of data as
far as possible (cf. line 4 in the code example above). If conversion is not possible, the interpreter will complain
about.
Strings
In Python strings are as simple as numbers. Just enclose some characters in single or double quotation marks and
they will become a Python string.
print(d)
Hello my friend!
Behavior of operators like + depends on the data type of the operands. Adding two integers 123 + 456 yields the
integer 579. Adding two strings '123' + '456' yields the string '123456'.
If a string contains single quotation marks, then use double quotation marks and vice versa. Alternatively, you may
escape quotation marks in a string with a backslash.
print(a)
print(b)
print(c)
print(d)
He isn't cool.
He isn't cool.
He said: "Your are crazy"
He said: "Your are crazy"
Boolean Values
Boolean values or truth values can hold either True or False. Typically, they are the result of comparisons. Boolean
values support logical operations like and, or, and not (see Logic (page 329)).
a = True
b = a and False
c = not a
d = a or b
print(a)
print(b)
print(c)
print(d)
True
False
False
True
6.2.4 Functions
A function is a piece of Python code with a name. To execute the code we have to write its name, optionally followed
by parameters (sometimes denoted as arguments) influencing the function’s code execution. After executing the
function some value can be returned to the caller.
This concept is required in two circumstances:
• a piece of code is needed several times,
• readability shall be increased by hiding some code.
Built-in Functions
Python has several built-in functions, like print and input. The print function takes one or more variables and
prints them on screen. In case of multiple arguments outputs are separated by spaces. The input function may be
called without arguments. It waits for user input and returns the input to the calling code.
a = input()
print('You typed:', a)
Above we also met the int function, which converts a string to an integer if possible. The int function behaves
exactly in the same way as all other functions, but it is not a built-in function in the stricter sense. Instead, it’s the
constructor of a class, a concept we’ll discuss later on.
Keyword Arguments
Functions accept different kinds of arguments. Some are passed as they are (like for print). Those are called
positional arguments and we meet them in almost all programming languages.
In Python often we’ll see function calls of the form some_function(argument_name=passed_value).
Such arguments are called keyword arguments and help to increase code readability. If a function accepts multiple
keyword arguments, we do not have to care about which one to pass first, second and so on. Details will be discussed
in a separate chapter on functions later on.
Function Definitions
def say_hello(name):
''' Print a hello message. '''
say_hello('John')
say_hello('Anna')
Hello John!
Hello Anna!
Note the indentation of the function’s code and the docstring '''...'''. The indentation tells the Python inter-
preter which lines of code belong to the function. The docstring is ignored by the interpreter like a comment. But
tools for automatic generation of software documentation extract the docstring and process it.
To return a value (like input does) we would have to add a line containing return my_value. The re-
turn keyword stops execution of the function and returns control to the calling code. We place return wherever
appropriate for our purposes. Often, but not always, it’s in the last line of the function’s code.
Important: Variables introduced inside a function, like message above, are only accessible inside that function.
But variables defined outside a function are accessible inside functions, too. It’s considered good practice to keep
inside and outside variables separated. That is, don’t use outside variables inside a function. Instead pass all values
required by the function as arguments and return results required outside a function with return. Exceptions prove
the rule.
Errors in Functions
If there is an error in a function’s code, the Python interpreter will show an error message together with a traceback.
That’s a list of code lines leading to the erroneous line. If a program calls a function which again calls a function
which contains an error, the traceback will have three entries.
In the following example the variable name is incorrect in the print line.
def say_hello(name):
''' Print a hello message. '''
say_hello('John')
say_hello('Anna')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [15], in <cell line: 8>()
4 message = 'Hello ' + name + '!'
5 print(mesage)
----> 8 say_hello('John')
9 say_hello('Anna')
If there would be an error in print (because you passed an unexpected argument, for instance), then the traceback
would have an additional entry showing the erroneous line in the definition of print.
Hint: Tracebacks may become very long if your code implies a problem in some built-in or library function. Check
the traceback carefully to find the last entry referring to your code. That’s the most likely location of the problem’s
cause.
Up to now program flow is linear. There is one path and the interpreter will follow this path. Here comes the first
element of flow control: conditional execution.
if a > b:
print('First number is greater.')
else:
print('First number is not greater.')
If the condition is satisfied, then the first code block is executed. If it is not satisfied, the else block is executed.
For equality use ==, for inequality use !=. Other comparison operators are <, >, <=, >=.
A comparison evaluates to a boolean value. Thus, more complex conditions can be constructed with the help of
boolean operators.
if a < 0:
print('It\'s a negative number.')
elif a == 0:
print('It\'s zero.')
elif a < 10:
print('It\'s a small positive number.')
else:
print('It\'s a large positive number.')
The second element of flow control, next to conditional execution, is repeated execution. Python provides two tech-
niques: while loops and for loops.
For Loops
1
4
9
16
25
36
49
64
81
Note that 100 is not printed. The loop always stops before the final number is reached.
Note: Whenever you have to define a range of integers in Python the upper bound has to be the last value you need
plus 1. If you already tried some other programming language, this peculiarity of Python needs getting used to.
While Loops
my_number = 10
print('Correct!')
No Do-While Loops
Many programming languages have a so called do-while loop. That’s like a while loop, but the condition is checked
at the loop’s end. Thus, the loop’s code block is executed at least once. Python does not have a while loop.
Guido van Rossum, Python’s BDFL47 , rejected a Python enhancement proposal (PEP) which suggested to introduce
a do-while loop48 with the following words49 :
47 https://en.wikipedia.org/wiki/Benevolent_dictator_for_life
48 https://peps.python.org/pep-0315/
49 https://mail.python.org/pipermail/python-ideas/2013-June/021610.html
Please reject the PEP. More variations along these lines won’t make the language more elegant or easier
to learn. They’d just save a few hasty folks some typing while making others who have to read/maintain
their code wonder what it means.
For and while loops provide the keywords break and continue. With break we can abort execution of the
loop. With continue we can stop execution of the loop’s code block and immediately begin the next iteration.
Loops may have an else code block. The else block is executed if iteration terminates regularly. It is skipped, if
iteration is stopped by break.
Note: Whereas for and while loops are available in almost all programming languages, the else block is a special
feature of Python.
6.2.7 Lists
Next to the simple data types above there are more complex ones. Here we restrict our attention to lists. A list can
hold a number of values. The length of a list is returned by the built-in function len. Square brackets [ and ] are
used for defining a list and for accessing single elements of a list.
print('List:', a)
List indices start with 0 in Python. Consequently, the last element of an n-element list has index n-1. The above code
to access the last element is considered non-pythonic. Why this is the case and how to make it better will be discussed
later on.
Lists may contain arbitrary types of data. Even lists of lists are allowed. This way we can construct two-dimensional
data structures.
print(a[0])
print(a[0][0], a[2][0])
[3, 4, 5]
3 4
a = [3, 4, 5]
print(a)
a[1] = 1000
print(a)
[3, 4, 5]
[3, 1000, 5]
How to append elements to an existing list and many more list related topics will be discussed later on.
Note: A list may have length 0, that is, it may be empty. Empty lists occur frequently because often one wants to
fill lists item by item, starting with an empty list. To get an empty list in Python write [].
With the above building blocks at hand we may write arbitrarily complex programs. There is nothing more we need.
It’s like Lego blocks50 . Take lots of simple blocks, add some creativity, and think about how to reach your aim step
by step.
All the other features of Python we’ll discuss soon only exist to simplify programming, save some time and make
programs more readable. But they won’t add new possibilities.
To be correct: there are a small number of additional built-in functions we need to know, like open for accessing
files.
Building everything from scratch is a long and winding road. So we’ll use other people’s code and combine it to
new and larger projects. There’s a large library of ready-to-use code snippets, called the Python standard library51 .
For specific tasks like data science and AI there are specialized libraries containg thousands of functions we may use
without implementing them ourselves. Examples are Matplotlib52 , Pandas53 , Scikit-Learn54 and Tensorflow55 .
50 https://en.wikipedia.org/wiki/Lego
51 https://docs.python.org/3/library/
52 https://matplotlib.org/
53 https://pandas.pydata.org/
54 https://scikit-learn.org/stable/
55 https://www.tensorflow.org/
6.3 Screen IO
Input from keyboard and output to screen are important for a program’s interaction with the user. Here we discuss
some frequently used features of terminal and Jupyter based screen IO. Graphical user interfaces (GUIs) are possible
in Python, too, but won’t be discussed here.
6.3.1 Input
The input function does not provide more functionality than we have already seen. It takes one argument, the text
to be printed on screen, and returns a string with the user’s input.
6.3.2 Output
The print function provides some fine-tuning. We already saw that it takes an arbitrary number of arguments.
Each argument which is not a keyword argument will be send to screen. Outputs of multiple arguments are separated
by one space. Non-string arguments are converted to strings automatically.
Instead of spaces, different separators can be specified with the keyword argument sep='...'.
a = 3
b = 2.34
c = 'some_string'
3 | 2.34 | some_string
Note: Note the difference to print(a, b, c, ' | '), which yields 3 2.34 some_string | .
The print function automatically adds a line break to the output. If this is not desired we may pass something else
as keyword argument end='...', an empty string for instance.
Note: The \ character is not printed but interpreted as escape charater. The character following \ is a
command to the output algorithm. Next to \n we already met \' and \". If you have to print a \ on screen use \\.
Calling print without any arguments prints a line break. Thus, print() and print('') are equivalent.
In JupyterLab and also in the Python interpreter’s interactive mode we do not have to write print(...) everytime
we want to see some result. JupyterLab automatically prints the value of the last line in a code cell. The Python
interpreter automatically prints the value of the last command issued. Instead of
print(123 * 456)
56088
123 * 456
56088
While plain Python calls print on the value to output, JupyterLab calls it’s own function display. For simple data
like numbers and strings display does the same as print, but for complex data like tables or images display
produces richer output. Even audio files and videos may be embedded into Jupyter notebooks by calling display
on suitably prepared data (or leaving this to JupyterLab if data is produced in the last line of a cell).
Source code libraries contain reusable code. In Python reusing code written by other people is very simple and there
are lots of code libraries available for free. Code libraries for Python are organized in modules and packages.
Next to built-in functions like print and input Python ships with several modules, which can be loaded on demand.
A module is a collection of additional functionality. Everybody can write and publish Python modules. How to do this
will be explained later on. Modules either are written in Python or in some other laguage, mainly the C programming
language.
Before we can use functionality of a module we have to import it:
import numpy as np
Hint: A number of modules comes pre-installed with Python (the Python standard library56 ). But many others have
to be installed separately. Whenever Python shows ModuleNotFoundError you forgot to install the module you
want to import. Install a module via Anaconda Navigator or with conda install module_name in a terminal,
see Install Jupyter Locally (page 289) project for more details.
The code above imports the module numpy and makes it accessible under the name np, which is shorter than
‘numpy’. NumPy57 is a collection of functions and data types for advanced numerical computations. We will dive
deeper into this module later on. To use NumPy’s functionality we have to write np.some_function with
some_function replaced by one of NumPy’s functions.
np.sin(0.25 * np.pi)
56 https://docs.python.org/3/library/
57 https://numpy.org
0.7071067811865475
Here, np.pi is a floating point variable holding an approximation of 𝜋. The function np.sin computes the sine
of its argument.
Note: The name np for accessing functionality of the module numpy can be choosen freely. But for widely used
modules like NumPy there are standard names everybody should use to improve code readability. Names everybody
should use are given in a module’s documentation (look at code examples there). Keep in mind: that’s a convention,
import numpy as wild_cat would by okay, too, from the technical point of view.
A package is a collection of modules. A module from a package can be imported via import package.module.
A very important Python package is Matplotlib58 for scientific plotting:
With plt.plot we create a line plot and plt.show displays this plot. If the code runs in JupyterLab the plot is
embedded into the notebook. If run by a plain Python interpreter a window opens showing the plot.
We will come back to Matplotlib when discussing data visualization.
58 https://matplotlib.org/
The Python programming language follows some basic design principles, which make the language very clear and in
some sense esthetic. One such principle is ‘Everything is an object’.
A Python object is a collection of variables and functions. We may use objects to bring structure into a program’s
variables and functions. Each object is of a certain type. An object’s type specifies a minimum list of variables and
functions an object contains or provides. Variables and functions belonging to an object are called member variables
and member functions. A synonym for member function is method.
Large programs have lots of variables and functions. By object-oriented programming we denote an approach to
structure the morass of variables and functions into objects. Python supports this programming style.
For the moment we look at objects as containers for structuring source code. Later on we will discuss more advanced
features of object-oriented programming like defining hierarchies of types (to untangle the morass of types…).
Example
Think of a Python program which does some geometrical computations and then plots a line and a circle. To specify
geometric properties we could use variables
• line_start_x,
• line_start_y,
• line_end_x,
• line_end_y,
• circle_x,
• circle_y,
• circle_radius.
In addition, we could have functions
• draw_line,
• draw_circle,
• move_line,
• move_circle,
• rotate_line,
and so on.
Utilizing the idea of objects we could introduce objects of type Point with member variables x and y as well as an
object of type Line with member variables start and end both of type Point. Analogously, an object of type
Circle with member variables center (of type Point) and radius would be nice. Both the Line object and
the Circle object could have member functions draw and move.
The object hierarchy could look like this:
• my_line
– start
∗ x
∗ y
– end
∗ x
∗ y
– draw()
– move(...)
– rotate(...)
• my_circle
– center
∗ x
∗ y
– radius
– draw()
– move(...)
Objects Everywhere
Surprisingly, in Python there are no variables which are not an object. Even integers are objects. Most other object-
oriented programming languages have some fundamental data types (integers, floats, characters) not represented as
objects in the sense of object-oriented programming. In Python there are no such fundamental types!
To get a list of all members (variables and functions) of an object, Python has the built-in function dir.
a = 2
dir(a)
['__abs__',
'__add__',
'__and__',
'__bool__',
'__ceil__',
'__class__',
'__delattr__',
'__dir__',
'__divmod__',
'__doc__',
'__eq__',
'__float__',
'__floor__',
'__floordiv__',
'__format__',
'__ge__',
'__getattribute__',
'__getnewargs__',
'__gt__',
'__hash__',
'__index__',
'__init__',
'__init_subclass__',
'__int__',
'__invert__',
(continues on next page)
Note: The dir function returns a list of strings. Since it’s the last line of a cell, this list is printed to screen by
JupyterLab automatically.
Most of all these members won’t be used directly. The __add__ function, for instance, is called by the Python
interpreter whenever we add integers. The following line of code is equivalent to a + 3:
a.__add__(3)
Each object has a type. Objects of identical type provide identical functionality. An object’s type is returned by
Python’s built-in function type.
type(a)
int
Note: As for dir above, the type function does not print anything. Screen output is done by JupyterLab automat-
ically here. Note that Python’s print yields <class 'int'> here whereas JupyterLab’s display produces
int to visualize type’s return value.
Note that the dot syntax object.member already appeared when accessing modules: module.function. This
is not by chance. Everything is an object in Python!
import numpy as np
type(np)
module
Importing a module leads to a new object of type module providing all the functionality of the imported module in
form of member functions and variables.
import numpy as np
dir(np)
['ALLOW_THREADS',
'AxisError',
'BUFSIZE',
'CLIP',
'ComplexWarning',
'DataSource',
'ERR_CALL',
'ERR_DEFAULT',
'ERR_IGNORE',
'ERR_LOG',
'ERR_PRINT',
'ERR_RAISE',
'ERR_WARN',
'FLOATING_POINT_SUPPORT',
'FPE_DIVIDEBYZERO',
'FPE_INVALID',
'FPE_OVERFLOW',
'FPE_UNDERFLOW',
'False_',
'Inf',
'Infinity',
(continues on next page)
def my_function():
print('Here we could do something useful.')
type(my_function)
function
type(type(my_function))
type
type(print)
builtin_function_or_method
We may define new objects with the class keyword. Instead of ‘object type’ one often says ‘class’. To create a class
for describing geometric points we could write
class Point:
This defines a new object type or class Point. Here __init__ is the only member function. The function with
this special name is implicitly called by Python whenever a new Point object has been created. This initialization
function expects the coordinates of the new point and stores them internally as member variables. The self argument
provides access to the newly created object.
A class is like a blueprint. But the methods defined in the blueprint are called for concrete objects. The self
argument gives access to the object for which a method has been called. Remember: there is only one class (blueprint),
but there may be many objects of this type.
Note: From a technical point of view instead of self we may use any name for the first argument (bad practice!).
But the first argument of a method always is the object for which the method has been called.
Note: Methods with two leading and two trailing underscores are called magic methods or dunder methods (dunder
= double underscore). They are called by the Python interpreter for special purposes. We already met another
dunder method: __add__, which is called whenever the + operator is used on an object. In contrast to most other
programming languages virtually every operation in Python can be customized by implementing a suitable dunder
method.
If we write
center = Point(3, 7)
the Python interpreter creates a new object of type Point. At this moment the new object has only one method,
__init__ and no member variables. Immediately after creation Python calls the object’s __init__ function.
The arguments passed to __init__ are the newly created object, 3 and 7.
Now we can work with our object as expected.
print(center.x)
center.x = center.x + 2
print(center.x)
3
5
type(center)
__main__.Point
dir(center)
['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'x',
'y']
The reason for __main__ in the object type will be discussed later on. From the member list we see that there are
many automatically created members. Some of them will be discussed later on, too.
SEVEN
Variables in Python follow a different, much more elegant approach than in other programming languages. In this
chapter we discuss the details of variables in Python and of some consequences of Python’s approach to variables.
• Names and Objects (page 75)
• Types (page 80)
• Operators (page 85)
• Efficiency (page 89)
Related exercises:
• Variables and Operators (page 253)
• Memory Management (page 256)
Related project: Vector Multiplication (page 301).
In contrast to most other programming languages Python follows a very clean and simple approach to memory access
and management. We now dive deeper into Python’s internal workings to come up with a new understanding of what
we called variables in the Crash Course (page 43).
Most programming languages, C for instance, assign fixed names to memory locations. Such combinations of memory
location and name are known as variables. Assigning a value to a variable then means, that the compiler or interpreter
writes the value to the memory location to which the variable’s name belongs. There is a one-to-one correspondence
between variable names and memory locations.
Consider the following C code:
int a;
int b;
a = 5;
b = a;
The first two lines tell the C compiler to reserve memory for two integer variables. The third line writes the value 5
to the location named a. The fourth line reads the value at the location named a and writes (copies) it to the location
named b.
75
Data Science and Artificial Intelligence for Undergraduates
Fig. 7.1: Memory is organized as a linear sequence of bytes. Used and currently unused bytes are managed by the
operating system and by compiler. In C programs there is a one-to-one correspondence between variable names and
memory locations.
Python allows for multiple names per memory location and adds a layer of abstraction.
In Python everything is an object and objects are stored somewhere in memory. If we use integers in Python, then
the integer value is not written directly to memory. Instead, additional information is added and the resulting more
complex data structure is written to memory.
A newly created Python object does not have a name. Instead, Python internally assigns a unique number to each
object, the object identifier or object ID for short. Thus, there is a one-to-one correspondence between object IDs and
memory locations.
In addition to a list of all object IDs (and corresponding memory locations), Python maintains a list of names occuring
in the source code. Each name refers to exactly one object. But different names may refer to the same object. In this
sense Python does not know variables as described above, but only objects and names tied to objects.
Consider the following code:
a = 5
b = a
The first line creates an integer object containing the value 5 and then ties the name a to this object. The second line
takes the object referenced by the name a and ties a second name b to it.
Fig. 7.2: In Python one memory location may have several names, but a unique object ID.
Important: Assignment operation = in Python is not about writing something to memory. Instead, Python takes
the existing object on the right-hand side of = and ties an additional name to it.
The object on the right-hand side may have existed before or it may be created by some operation specified by the
code following =.
It’s also possible to create nameless objects. Simply omit name = before some object creation code.
print(id(a))
print(id(b))
139672564187504
139672564187504
In Python we have objects and we have values contained in the objects. Thus, there are two fundamentally different
questions which might be relevant for controlling program flow:
• Do two names refer to the same object?
• Do two objects (refered to by two names) contain the same value?
Consider the following code:
a = 1.23
b = 1.23
It creates two float objects both holding the value 1.23. To see that there are two objects we can look at the object
IDs:
print(id(a))
print(id(b))
139672516871888
139672519436656
So the answer to the first question is ‘no’, but the answer to the second question is ‘yes’.
To compare equality of objects Python knows the is operator. To compare equality of values Python has the ==
operator. Both yield a boolean value as result.
print(a is b)
print(a == b)
False
True
Negations of both operators are is not and !=, respectively. Using is is equivalent to comparing object IDs:
print(id(a) == id(b))
False
Hint: Behavior of the is operator is hardwired in Python (use == on integer objects returned by id). But == simply
calls the dunder method __eq__ of the left-hand side object. Thus, what happens during comparison depends on an
object’s type. Writing your own classes (object types) you may implement the __eq__ method whenever appropriate.
Without custom implementation Python uses a default one behaving similarly to is.
Names in Python have a scope, that is, a region of code where they are valid. Names defined outside functions and
other structures are referred to as global names or global variables or simply globals. If a name is defined (that is, tied
to some object) inside a function or some other structure, then the name is local. Local names are undefined outside
the function or structure they are defined in.
def my_func():
print(c)
d = 456
c = 123
my_func()
print(d)
123
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [7], in <cell line: 7>()
5 c = 123
6 my_func()
----> 7 print(d)
If there is a local name which is also a global name, than it’s local version is used and the global one is left untouched.
def my_func():
c = 456
print(c)
c = 123
my_func()
print(c)
456
123
But how to change a global variable from inside a function? The global keyword tells the interpreter that a name
appearing in a function refers to a global variable. The interpreter then uses the global variable instead of creating a
new local variable.
def my_func():
global c
c = 456
print(c)
(continues on next page)
c = 123
my_func()
print(c)
456
456
We cannot access a global variable from inside a function and then introduce a local variable with the same name.
This leads to an error because each name appearing in an assignment in a function is considered local throughout
the function. Consequently, accessing the value of a global variable before creating a corresponding local variable is
interpreted as accessing an undefined name. The interpreter then complains about accessing a local variable before
assignment.
def my_func():
print(c)
c = 456
c = 123
my_func()
print(c)
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Input In [10], in <cell line: 6>()
3 c = 456
5 c = 123
----> 6 my_func()
7 print(c)
Important: It’s considered bad practice to use lots of global variables. Global variables result in low readability of
code. Exceptions prove the rule.
7.2 Types
Here we introduce two more very fundamental object types, discuss type conversion and introduce the Python-specific
concept of immutability.
In the Crash Course (page 43) we met several data or object types (classes):
• integers
• floats
• booleans
• lists
We also met strings, which will be discussed in more detail later on. Python ships with some more data types. Next
to complex numbers, which we do not consider here, and several list-like types Python knows two very special data
types:
• None type
• NotImplemented type
Both types can represent only one value: None and NotImplemented, respectively. Thus, they can be considered
as constants. But since in Python everything is an object, constants are objects, too. Objects have a type (class). Thus,
there is a None type and a NotImplemented type.
print(type(None))
print(type(NotImplemented))
<class 'NoneType'>
<class 'NotImplementedType'>
Existence of NotImplemented will be justified soon. Typically it’s used as return value of functions to signal the
caller that some expected functionality is not available.
The value None is used whenever we want to express that a name is not tied to an object. In that case we simply tie
the name to the None object. We write ‘the’ because the Python interpreter creates only one object of None type.
Such an object can hold only one and the same value. So there is no reason to create several different None objects.
a = None
b = None
print(id(a))
print(id(b))
94287211210560
94287211210560
a = 'Some string'
b = 'Some string'
print(id(a))
print(id(b))
139871271436464
139871271437360
In the second code block two string objects are created although both hold the same value. For both None values in
the first code block only one object is created. If you play around with this you may find, that for short strings and
small integers Python behaves like for None. This issue will be discussed in detail soon.
Hint: None is a Python keyword like if or else or import. It is used to refer to the object of None type. But
the memory occupied by this object does not neccessarily contain a string ‘None’ or something similar. In fact, this
object does not contain something useful. Its mileage is its existence, which allows to tie (temporarily unused) names
to it. Same holds for NotImplemented.
We already met this concept when introducing boolean values. True and False are Python keywords, too. They
are used to refer to two different objects of type bool. But these objects do not contain a string ‘True’ or ‘False’
or something similar. Instead, a bool object stores an integer value: 1 for the True object and 0 for the False
object. How to represent None, True and so on in memory depends on the concrete implementation of the Python
interpreter and is not specified by the Python programming language.
Type casting means change of data type. An integer could be casted to a floating point number, for example. Python
does not have a mechanism for type casting. Instead, dunder methods can be implemented to work with objects of
different types.
A very prominent dunder method for handling different object types is the __init__ method, which is called after
creating a new object. Its main purpose is to fill the new object with data. For Python standard types like int,
float, bool the __init__ method accepts several different data types as argument.
We’ve already applied the function int, which creates an int object, to strings. Thus, we have seen that the
__init__ method of int objects accepts strings as argument and tries to convert them to an integer value. The
other way round, str for creating string objects accepts integer arguments.
a = '123' # a string
b = int(a)
print(type(b))
print(b)
<class 'int'>
123
a = 123 # an integer
b = str(a)
print(type(b))
print(b)
<class 'str'>
123
a = 2 # an integer
b = float(a)
print(type(b))
print(b)
<class 'float'>
2.0
Data may get lost due to type casting. The Python interpreter will not complain about possible data loss.
7.2. Types 81
Data Science and Artificial Intelligence for Undergraduates
a = 2.34 # a float
b = int(a)
print(type(b))
print(b)
<class 'int'>
2
Hint: It’s good coding style to use explicit type casting instead of relying on implicit conversions whenever this
increases readability.
A counter example is 1.23 * 56, where the integer 56 is converted to float implicitely. Explicit casting would
decrease readability: 1.23 * float(56).
Note: If you define a custom object type, it depends on your implementation of the type’s __init__ method what
data types can be cast to your type.
Casting to bool maps 0, empty strings and similar values to False, all other values to True.
print(bool(None))
print(bool(0))
print(bool(123))
print(bool(''))
print(bool('hello'))
False
False
True
False
True
If we use non-boolean values where booleans are expected, Python implicitly casts to bool:
if not '':
print('cumbersome condition satisfied')
For historical reasons boolean values internally are integers (0 or 1). This sometimes yields unexpected (but well-
defined) results. An example is the comparison of integers to True.
a = 3
if a:
print('first if')
else:
print('first else')
if a == True:
print('second if')
(continues on next page)
first if
second else
The first condition is equivalent to bool(3), which yields True, whereas the second is equivalent to 3 == 1,
yielding False. See PEP 28559 for some discussion of that behavior (PEP 285 introduced bool to Python).
7.2.4 Immutability
Objects in Python can be either mutable or immutable. Mutable objects allow modifying the value they hold. Im-
mutable objects do not allow changing their values. Objects of simple type like int, float, bool, str are
immutable whereas lists and most others are mutable.
Understanding the concept of (im)mutability is fundamental for Python programming. Even if the source code
suggests that an immutable object gets modified, a new object is created all the time:
a = 1
print(id(a))
a = a + 1
print(id(a))
139871335317744
139871335317776
This code snipped first creates an integer object holding the value 1 and then ties the name a to it. In line 3, slop-
pily speaking, a is increased by one. More precisely, a new integer object is created, holding the result 2 of the
computation, and the name a is tied to this new object.
Mutable objects behave as expected:
a = [1, 2, 3]
print(id(a))
a[0] = 4
print(id(a))
139871271501888
139871271501888
Immutability of some data types allows the Python interpreter for more efficient operation and for code optimization
during execution. We will discuss some of those efficiency related features later on.
Always be aware of (im)mutability of your data. The following two code samples show fundamentally different
behavior:
a = 1 # immutable integer
b = a
a = a + 1
print(a, b)
59 https://peps.python.org/pep-0285/#resolved-issues
7.2. Types 83
Data Science and Artificial Intelligence for Undergraduates
2 1
Increasing a does not touch b, because the integer object a and b refer to is immutable. Increasing a creates a new
object. Then a is tied to the new object and b still refers to the original one.
a[0] = 4
print(a, b)
[4, 2, 3] [4, 2, 3]
Modifying a also modifies b, because a and b refer to the same mutable object.
Although rarely needed, we mention the built-in function isinstance. It takes an object and a type as parameters
and returns True if the object is of the given type.
print(isinstance(8, int))
print(isinstance(8, str))
print(isinstance(8.0, float))
True
False
True
There are a bunch of dunder functions one should implement when creating custom types (cf. Custom Object Types
(page 73)):
• __str__ is called by the Python interpreter to get a text representation of an object. For instance, it’s called
by print and whenever one tries to convert an object to string via str(...).
• __repr__ is simlar to __str__ but should return a more informative string representation. In the best case,
it returns the Python code to recreate the object. See Python’s documentation60 for details.
• __bool__ is called whenever an object has to be cast to bool.
• __len__ is called by the built-in function len to determine an object’s length. This is useful for list-like
objects.
60 https://docs.python.org/3/reference/datamodel.html#object.__repr__
Since everything in Python is an object, types are objects, too. Thus, types may provide member variables and
methods in addition to the corresponding objects’ member variables and methods. In some programming languages
members of a type are called static members.
Member variables of types occur for instance if constants have to be defined (almost always for convenience):
class ColorPair:
red = (1, 0, 0)
green = (0, 1, 0)
blue = (0, 0, 1)
yellow = (1, 1, 0)
cyan = (0, 1, 1)
magenta = (1, 0, 1)
Member functions of types are rarely used. One usecase are very flexible contructors for complex types, which do
not fit into the __init__ method due to many different variants of possible arguments. Often such constructors
are named from_... and corresponding object creation code looks like
In such cases the from_... methods return a new object of corresponding type, that is, they implicitly call the
__init__ method.
Defining methods for types requires advanced syntax contructs we do not discuss here.
7.3 Operators
Like most other programming languages Python offers lots of operators to form new data from existing one. Important
classes of operators are
• arithmetic operators (+, -, *, /,…),
• comparison operators (==, !=, <, >,…),
• logical operators (and, or, not,…).
Expressions containing more than one operator are evaluated in well-defined order. Python’s operators can be listed
from highest to lowest priority. Operators with identical priority are evaluated from left to right.
7.3. Operators 85
Data Science and Artificial Intelligence for Undergraduates
Syntax Operator
** exponentiation
+, - (unary) sign
*, /, //, % multiplication, division
+, - (binary) addition, substraction
==, !=, <, >, <=, >= comparison
not logical not
and logical and
or logical or
We may write chained comparions like a < b < c. Python interprets them as single comparisons connected by
and, that is, a < b and b < c.
Note: Unfortunate expressions like a < b > c are allowed, too. This example is equivalent to a < b and b
> c. In a chain only neighboring operands are compared to each other! There is no comparison between a and c
here.
For arithmetic binary operators there’s a shortcut for expressions like a = a + b. We may write such as a +=
b. The latter is called an augmented assignment. Although the result will look the same, there are two technical
differences one should know:
• augmented assignment may work in-place,
• for augmented assignment the assignment target will be evaluated only once.
In-place Computations
For usual binary operators the Python interpreter calls corresponding dunder methods, like __add__ for +. Aug-
mented assignments have their own dunder methods starting with i. So += calls __iadd__, for instance. The
intention is that += may work in-place, that is, without creating a new object. Of course, this is only possible for
mutable objects. If there is no dunder method for augmented assignment the interpreter falls back to the usual binary
operator’s dunder method.
Example: The + operator applied to two lists concatenates both lists. With a = a + b a new list object holding
the result is created and then the name a is tied to the new object. With a += b list b is appended to the existing
list object referred to by a.
a = [1, 2, 3]
b = [4, 5, 6]
print(id(a))
a = a + b
print(a)
print(id(a))
61 https://docs.python.org/3/reference/expressions.html#operator-precedence
139941989941376
[1, 2, 3, 4, 5, 6]
139941989942272
a = [1, 2, 3]
b = [4, 5, 6]
print(id(a))
a += b
print(a)
print(id(a))
139941989933632
[1, 2, 3, 4, 5, 6]
139941989933632
If the assignment target is a more complex expression like for list items, the expression will be evaluated twice with
usual binary operators, but only once if augmented assignment is used.
Example: In the following code an item of a list shall be incremented by 1. The item’s index is computed by some
complex function get_index (which for demonstration purposes is very simple here). The two code cells show
different implementations, resulting in a different number of calls to get_index.
def get_index():
print('get_index called')
return 2
a = [1, 2, 3, 4]
a[get_index()] = a[get_index()] + 1
print(a)
get_index called
get_index called
[1, 2, 4, 4]
def get_index():
print('get_index called')
return 2
a = [1, 2, 3, 4]
a[get_index()] += 1
print(a)
get_index called
[1, 2, 4, 4]
Note: If efficiency matters you should prefer augmented assignments. Even a += b with integers a and b is more
7.3. Operators 87
Data Science and Artificial Intelligence for Undergraduates
efficient than a = a + b, because the name a will be looked up only once in the table mapping names to object
IDs.
All Python operators, == and + for instance, simply call a specially named member function of the involved objects,
a so called dunder method. Lines 3 and 4 of the following code cell do exactly the same thing:
a = 5
b = a + 2
c = a.__add__(2)
print(b)
print(c)
7
7
Dunder methods allow to create new object types which can implement all the Python operators themselve. What
an operator does depends on the operands’ object type. For instance, + applied to numbers is usual addition, but +
applied to strings is concatenation.
For binary operators like + and == there is always the question which of both objects to use for calling the corre-
sponding dunder method. In case of comparisons Python uses the dunder method of the left-hand side operand (up to
one minor exception which we don’t discuss here). For arithmetic operations Python always tries the left operand first
(again, we omit a minor exception). If it does not have the required dunder method, then Python tries the operand
on the right-hand side. If both objects do not have the dunder method, then then interpreter stops with an error.
Binary arithmetic operations might be unsymmetric. Thus, there are two variants of most arithmetic dunder methods:
one for applying an operation as the left-hand side operand and one for applying an operation as the right-hand side
operand. For addition the methods are called __add__ and __radd__, for multiplication we have __mul__ and
__rmul__. Others follow the same scheme.
Often binary operators shall be applied to objects of different types (adding integer and floating point values, for
instance). Even if both objects have the corresponding dunder method, one or both of them could lack code for
handling certain object types.
In such a case Python calls the dunder method and the method returns NotImplemented to signal that it doesn’t
know how to handle the other operand. Then the interpreter tries the dunder method of the other operand. If it
returns NotImplemented, too, then the interpreter stops with an error. The __add__ function of integer objects
cannot handle float objects, but __add__ of float objects can handle integers, for example:
a = 2
b = 1.23
print(a.__add__(b))
print(b.__add__(a))
NotImplemented
3.23
Writing a + b in the example above first calls a.__add__(b), which returns NotImplemented, then b.
__radd__. With b + a the first call goes to b.__add__ and no second call (to a.__radd__) is required.
An example of beneficial use of Python’s flexible mechanism for customizing operators is discussed in the project on
Vector Multiplication (page 301).
7.4 Efficiency
Python’s combination of the names/objects concept and (im)mutability tends to waste memory and CPU time:
• There might be many different objects holding all the same value. In principle, every time the number 1 occurs
in the source code, a new integer object is created.
• Tying names to other objects may leave objects without name. Such objects are no more accessible but resist
in memory.
• Modifying immutable objects requires to create new objects. Thus, even simple integer computations require
relatively complex memory management operations.
To mitigate these drawbacks, the Python interpreter uses several optimization strategies. Although such issues are
rather technical we briefly discuss them here, because they sometimes yield unexpected results.
To avoid object creation every time a new integer is used, the Python interpreter pre-creates integer objects for all
integers from -5 to 256. This saves CPU time. The somewhat cumbersome range stems from statistical considerations
about integer usage.
In addition, the interpreter takes care that no integer in this range is created twice during program execution. This
saves memory. The behavior is demonstrated in the following code snipped:
a = 8
b = 4 + 4
print(id(a))
print(id(b))
140115940688336
140115940688336
Both object IDs are identical, thus only one integer object is used. Since integer objects are immutable, this cannot
cause any trouble.
As for integers, the Python interpreter tries to avoid multiple string objects with the same value. Since corresponding
comparisons may require too much CPU time, this technique is only used for short strings. The rules controlling
which strings get interned and which not are relatively complex.
a = 'short'
b = 'sh' + 'ort'
print(id(a))
print(id(b))
140115859848048
140115859848048
7.4. Efficiency 89
Data Science and Artificial Intelligence for Undergraduates
140115859848944
140115899208176
Before executing a Python program, the interpreter checks the syntax and creates a list of all literals. Here, literals
are all types of explicit data appearing in the source code, like integers or strings. If some literal appears multiple
times and if objects of the corresponding data type are immutable, only one object is created.
# Copy the following Python code to a text file and feed the file to the
# Python interpreter to see the effect of optimization of repreated literals.
print(id(a))
print(id(b))
print(id(c))
print(id(b))
The names a and c will point to the same string object, although the string is too long to be interned by the string
interning mechanism. The names b and d will point to the same integer object, although they are outside the range
of preloaded integers.
Care has to be taken when using interactive Python interpreters like Jupyter. If the above code snipped is executed
line by line in an interactive interpreter, then four different objects will be created, because the interpreter does not
parse the full code in advance.
Important: Executing Python code with an interactive interpreter may yield different results than executing the
same code at once with a non-interactive interpreter! In particular, performance measures like memory consumption
may differ.
print(id(a))
print(id(b))
print(id(c))
print(id(d))
140115894986160
140115859673552
140115894986352
140115859679312
As described above, there might be objects without names. Such objects resist in memory, but are no more accessible.
To avoid filling up memory as time passes, the Python interpreter automatically removes nameless objects from
memory. This mechanism is known as garbage collection and is a feature not available in all programming languages.
In the C programming language, for instance, the programmer has to take care to free memory, if data isn’t needed
anymore.
Sometimes, especially when working with large data sets, one wants to get rid of some data in memory to have more
memory available for other purposes. One way is to tie all names refering to the no more needed object to other
objects, which is somewhat unintuitive. Alternatively, the del keyword can be used to untie a name from an object.
a = 5000
del a
print(a)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [5], in <cell line: 3>()
1 a = 5000
2 del a
----> 3 print(a)
The last line leads to an error message, because after executing line 2 the name a is no more valid. Note that del
only deletes the name, not the object. In the following code snipped the object remains in memory, because it has
another name:
a = 5000
b = a
del a
print(b)
5000
7.4. Efficiency 91
Data Science and Artificial Intelligence for Undergraduates
EIGHT
We already met lists in the Crash Course (page 43). Now it’s time to discuss some details and to introduce further
list-like objects types as well as Python features for efficiently working with lists and friends.
• Tuples (page 93)
• Lists (page 95)
• Dictionaries (page 98)
• Iterable Objects (page 98)
Related exercises: Lists and Friends (page 259).
8.1 Tuples
Tuples are immutable objects which hold a fixed-size list of other objects. Imagine tuples as a list of pointers to
objects, where the number of pointers and the pointers themselve cannot be changed. But note, that the objects the
tuple entries point to can change if they are mutable. Object type for tuples is tuple.
8.1.1 Syntax
Tuples are defined as a comma separated list. Often the list is surrounded by parentheses:
Tuples with only one item are allowed, too. To distinguish them from single objects, a trailing comma is required.
a = 1 # integer
b = (1) # integer
c = 1, # tuple containing one integer
d = (1,) # tuple containing one integer
93
Data Science and Artificial Intelligence for Undergraduates
8.1.2 Indexing
Tuple items can be accessed by index. The first item has index 0, the second index 1, and so on. The length of a tuple
is returned by the built-in function len. Indexing with negative numbers gives the items in reverse order.
print(colors[2])
print(colors[-1])
print(len(colors))
print(colors)
blue
yellow
4
('red', 'green', 'blue', 'yellow')
Subtuples can be extracted by so called slicing. Simply provide a range of indices: [2:5] gives a new tuple consisting
of the items 2, 3, 4. The new tuple is a new object and remains available even if the original tuple vanishes (cf. Garbage
Collection (page 91)).
some_colors = colors[1:3]
print(some_colors[0])
print(some_colors)
green
('green', 'blue')
Extracting every second item can be done by [3:10:2], which gives items with indices 3, 5, 7, 9. The general
syntax is [first_index:last_index_plus_one:step]. Here are some more indexing examples:
('blue', 'yellow')
('blue', 'yellow')
('blue',)
('red', 'green', 'blue')
('red', 'green', 'blue', 'yellow')
Tuples can be used for returning more than one value from a function. For this purpose Python provides the following
code construct:
a, b, c = 23, 42, 6
print(a, b, c)
23 42 6
That is, we can assigne the contents of a tuple to a tuple of names. Such constructions typically are used if a function
returns several return values packed into a tuple.
c = a // b
d = a % b
return c, d
print(quotient)
print(remainder)
7
9
some_list = [1, 2, 3, 4, 5]
some_tuple = tuple(some_list)
Note that both list and tuple point to the some items. Items aren’t copied! Modifying (mutable) list items will modify
tuple items, too. See Multiple Names and Copies (page 96) for more details.
8.2 Lists
Lists are mutable objects which hold a flexible number of other objects. Lists can be regarded as mutable tuples.
Indexing syntax is the same, including slicing. Type name for lists is list. Calling list(...) creates list from
several other types (tuples, for instance).
In principle, Python lists can hold different object types. But it is considered bad practice to use this feature. It’s
better to have lists made up of objects of the same type.
a = [] # empty list
b = [2, 4, 6] # list with three integer items
b.append(8) # now the list has four items
print(len(b)) # length of list
8.2. Lists 95
Data Science and Artificial Intelligence for Undergraduates
a = [1, 2, 3, 4]
a.extend([9, 8, 7])
print(a)
[1, 2, 3, 4, 9, 8, 7]
a = [3, 2, 5, 4, 1]
a.sort()
print(a)
[1, 2, 3, 4, 5]
To search a list use index. This method returns the index of the first occurrence of its argument in the list.
a = [3, 2, 5, 4, 1]
print(a.index(5))
To remove a list item either use the del keyword (remove by index) or the remove method (remove by value):
a = [1, 2, 3, 4]
del a[1]
print(a)
a.remove(3)
print(a)
[1, 3, 4]
[1, 4]
a = [1, 2, 3, 4]
b = a
b[0] = 9
print(a)
print(b)
[9, 2, 3, 4]
[9, 2, 3, 4]
In this code snipped a list is created and the name a is tied to it. Then this list object gets b as a second name.
Important: We have one (!) list object with two names, not two lists! Thus, modifying b also modifies a.
To copy a list use the copy method:
a = [1, 2, 3, 4]
b = a.copy()
b[0] = 9
print(a)
print(b)
[1, 2, 3, 4]
[9, 2, 3, 4]
This creates a so called shallow copy. A new list object is created and the items of the new list point to exactly the
same objects as the original list. If the original list consists of immutable objects (like integers), then modifying the
copy will not alter the original. But if list items point to mutable objects, then altering the objects of the copy will
modify the original list. Copying a list including all objects the list points to, is known as deep copying. How to
automatically deep-copy a list will be discussed later on.
Fig. 8.1: Python lists are lists of memory locations (object IDs). Shallow copying only copies the list of memory
locations. Deep copying also copies the data at those memory locations.
8.2. Lists 97
Data Science and Artificial Intelligence for Undergraduates
8.3 Dictionaries
Dictionaries are like lists, but indices (here denoted as keys) are not restricted to integers but can be of any immutable
type. Even tuples are allowed as keys if they do not contain mutable items. Data types for keys can be mixed. Type
name for dictionaries is dict.
person['age'] += 1
print(person['age'])
John
43
person['gender'] = 'male'
Note: Python follows the duck typing62 approach: If it looks like a duck, walks like a duck, swims like a duck, and
quacks like a duck, then it probably is a duck.
In other words: There are many object types in Python which behave like a list, but aren’t of type list. In particular,
len and indexing syntax [...] may be used for such objects.
Tuples, lists, and dictionaries are examples of iterable objects. These are objects which allow for consecutive evalution
of their items. There exist more types of iterable objects in Python and we may define new iterable objects by
implementing suitable dunder methods.
62 https://en.wikipedia.org/wiki/Duck_typing
We briefly mentioned for loops in the Crash Course (page 43). Now we add the details.
Basic Iteration
For loops allow to iterate through iterable objects of any kind, especially tuples, lists and dictionaries.
red
green
blue
yellow
The indented code below the for keyword is executed as long as there are remaining items in the iterable object. In
the example above the name color first points to colors[0], then, in the second run, to colors[1], and so
on.
This index-free iteration is considered pythonic: it’s much more readable than indexing syntax and directly fulfills the
purpose of iteration, while indexing in most cases is useless additional effort.
Note: The range built-in function we saw in the crash course is not part of the for loop’s syntax. Instead, it’s a
built-in function returning an iterable object which contains a series of integers as specified in the arguments passed
to the range function. The returned object is of type range.
The range function takes up to three arguments: the starting index, the stopping index plus 1, and the step size.
Step size defaults to 1 if omitted. If only one argument is provided, it’s interpreted as stopping index (plus 1) and
start index is set to 0.
If next to the items themselve also their index is needed inside the loop, use the built-in function enumerate and
pass the iterable object to it. The enumerate function will return an iterable object which yields a 2-tuple in each
iteration. The first item of each tuple is the index, the second one is the corresponding object.
0: red
1: green
2: blue
3: yellow
Being new to Python one is tempted to indexing for iteration over multiple lists in parallel:
# non-pythonic!
for i in range(len(names)):
print(names[i] + ' ' + surnames[i])
John Doe
Max Muller
Lisa Lang
For iterating over multiple lists at the same time pass them to zip. This built-in function returns an iterable objects
which yields tuples. The first tuple consists of the first elements of all lists, the second of the second elements of all
lists and so on. The returned iterable object has as many items as the shortest list.
John Doe
Max Muller
Lisa Lang
8.4.2 Comprehensions
Applying some operation to each item of an iterable object can be done via for loops. But there are handy short-hands
known as list comprehensions and dictionary comprehensions.
List Comprehensions
General syntax is
The following code snipped generates a list of squares from a list of numbers.
some_numbers = [2, 4, 6, 8]
print(squares)
some_numbers = [2, 4, 6, 8]
more_numbers = [1, 2, 3, 4]
print(products)
Same principles (zipping and nesting) work for more than two lists, too.
Dictionary Comprehensions
print(stars)
print(stars)
Note: Dictionary comprehensions may be used to create new dictionaries from arbitrary iterable objects. For instance
we could loop over the zipped lists, one holding the keys and one holding the values.
Conditional Comprehensions
List and dictionary comprehensions can be extended by a condition: simply append if some_condition to the
comprehension.
some_numbers = [2, 4, 6, 8]
more_numbers = [1, 0, 2, 3]
print(quotients)
The list (or dictionary) comprehension drops all items not satisfying the condition.
Python has a built-in function next which allows to iterate through iterable objects step by step. For this purpose
we first have to create an iterator object from our iterable object. This is done by the built-in function iter. Then
the iterator object is passed to next. The iterator object takes care about what the next item is.
a = iter([3, 2, 5])
print(next(a))
print(next(a))
print(next(a))
3
2
5
Creation of the intermediate iterator object is done automatically if iterable objects are used in for loops and com-
prehensions.
a = [1, 5, 6, 8]
print(1 in a)
print(10 in a)
True
False
True
False
True
Note: Conditions a not in b and not a in b are equivalent. The first uses the not in operator, while the
second uses in and then applies the logical operator not to the result.
NINE
STRINGS
We already learned basic usage of strings in the Crash Course (page 43). Here we add the details. Correct string
handling is essential for loading data from files and also for writing to files.
• Basics (page 105)
• Special Characters (page 107)
• String Formatting (page 109)
Related exercises: Strings (page 262).
9.1 Basics
9.1.1 Substrings
Strings are sequence-type objects, that is, they can be regarded as lists of characters. Each character then is itself a
string object.
a = 'some string'
print(a[0])
print(a[1])
print(len(a))
s
o
11
b = a[2:8]
print(b)
me str
Remember that strings are immutable. Thus, a[3] = 'x' does not replace the fourth character of a by x, but
leads to an error message.
105
Data Science and Artificial Intelligence for Undergraduates
Sometimes it’s necessary to use line breaks when specifying strings in source code. For this purpose Python knows
triple quotes.
print(a)
print(b)
As we see, line breaks in source code become part of the string. If this behavior is not desired, end each line with \.
print(a)
a = '''\
first line
second line
last line\
'''
print(a)
first line
second line
last line
This way we get one-to-one correspondence between source code and screen output.
Note: A trailing \ tells Python that the souce code line continues on the next line, that is, that the next line break
is not to be considered as line break. Usage is not restricted to strings. Long source code lines (longer than 80
characters) should be wrapped to multiple short lines.
To have line breaks and special characters in string literals we have to use escape sequences like \n, \", and so on. If
we want to prevent the Python interpreter from translating escape sequences into corresponding characters, we may
use raw strings. A raw string is taken as is by the interpreter. To mark a string literal as raw string prepend r.
Strings may contain characters not available on some or all keyboards, like umlauts ä, ö, ü on US keyboards. To use
such special characters in string literals Python provides escape sequences \x, \u, and \U.
The relation between characters in strings and their numerical representation in memory will be considered in detail
in the chapter on Text Files (page 114). For the moment we content ourselves with the observation that there are
several such mappings, the most prominent ones known as ASCII69 , the ISO 885970 family and several Unicode71
variants.
ASCII knows 128 different characters, the ISO 8859 family some hundred, and Unicode several million ones. Nowa-
days, Unicode in its UTF-872 variant is the standard mapping for numerical representations of characters. Even
Windows is adopting UTF-8 more and more after backing the wrong horse for two decades.
There are lots of searchable lists of Unicode characters in the web. Wikipedia’s List of Unicode characters73 is a
good starting point.
63 https://docs.python.org/3/library/stdtypes.html#str.count
64 https://docs.python.org/3/library/stdtypes.html#str.find
65 https://docs.python.org/3/library/stdtypes.html#str.replace
66 https://docs.python.org/3/library/stdtypes.html#str.split
67 https://docs.python.org/3/library/stdtypes.html#str.upper
68 https://docs.python.org/3/library/stdtypes.html#str.isalnum
69 https://en.wikipedia.org/wiki/ASCII
70 https://en.wikipedia.org/wiki/ISO/IEC_8859
71 https://en.wikipedia.org/wiki/Unicode
72 https://en.wikipedia.org/wiki/UTF-8
73 https://en.wikipedia.org/wiki/List_of_Unicode_characters
To use Unicode characters in a string literal we either may type or copy them to the source code file or we may specify
the character numerically. If the numercial representation has two hexadecimal digits, use \x followed by the two
digits. In case of 4 digits use \u and for 8 digit characters use \U.
umlauts: ä, ö, ü
Greek: α, β, γ
Chinese (?): ,
Some Unicode codes do not represent concrete characters, but control reading direction (left to right or right to left)
and other properties like spacing.
Fig. 9.1: Collaborative editing can quickly become a textual rap battle fought with increasingly convoluted invocations
of U+202a to U+202e. Source: Randall Munroe, xkcd.com/113774
74 https://xkcd.com/1137
String objects have a format method. This method allows for converting numbers and other data to strings.
a = 4
b = 5
c = a + b
nice_string = 'The result of {} plus {} is {}.'.format(a, b, c)
print(nice_string)
Calling the format method of a string replaces all pairs {} by the arguments passed to the format method. Next to
integers also floats and strings can be passed to format. The original string is not modified because it’s immutable.
Instead, format returns a new string object.
The arguments of format can also be accessed by providing their index: the first argument has index 0, the second
has index 1, and so on.
a = 4
b = a + a
print('The result of {0} plus {0} is {1}.'.format(a, b))
For more complex output (or more readable code) keyword arguments can be passed to format.
Converting numbers to strings sometimes requires additional parameters: How many digits to use for floats? Shall
numbers in several lines be aligned horizontally? There is a whole ‘mini language’ for writing formatting options.
Here we provide only some examples. For details see Python documentation on format string syntax75 .
75 https://docs.python.org/3/library/string.html#formatstrings
If indices or names are used for refering to format’s arguments, they have to be placed on the left-hand side of the
collon: {name:3.2f}.
Starting with version 3.6 of Python there is a more comfortable way to format strings. It’s very similar to formatting
via format method, but requires less code and increases readability. The two major differences are:
• string literals have to be prefixed by f,
• the curly braces may contain a Python expression (an object name, for instance).
a = 123
print(f'Here you see {a}.')
b = 4.56789
print(f'Formatting works as above: {b:.2f} is a rounded float.')
76 https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals
TEN
ACCESSING DATA
There exist several sources of data: files, data bases, and web services, for instance. Here we consider basic file access
and common file formats for storing large data sets. We also learn how to automatically download data from the web
and how to scrape data from websites.
• File IO (page 111)
• Text Files (page 114)
• ZIP Files (page 117)
• CSV Files (page 118)
• HTML Files (page 119)
• XML Files (page 121)
• Web Access (page 122)
Related exercises: File Access (page 264).
Related projects:
• Cafeteria (page 315)
• Weather (page 305)
– DWD Open Data Portal (page 305)
– Getting Forecasts (page 307)
10.1 File IO
Next to screen IO, input from and output to files is the most basic operation related to data processing. Almost all
data science projects start with reading data from one or more files. In this chapter we discuss basic file access. Later
on there will be several specialized modules and functions to make things more straight forward. But from time to
time, in case of uncommon file formats, one has to resort to the most basic operations.
10.1.1 Basics
Reading data from a file or writing data to a file requires three steps:
1. Open the file
Tell the operating system, that file access is required. The operating system checks permissions and, if everything is
okay, returns a file identifier (usually a number), which has to be used for all subsequent file operations.
2. Read or write data
Tell the operating system to move data between the file and some place in memory which can be accessed by the
Python interpreter.
111
Data Science and Artificial Intelligence for Undergraduates
f = open('testdir/testfile.txt', 'r')
file_content = f.read()
f.close()
print(file_content)
Some text
in some file
splitted over
multiple lines.
This code snipped opens a file for reading (argument 'r') and assignes the name f to the resulting file object. Then
the whole content of the file is stored in the string object file_content. Finally, the file is closed and it’s content
is printed to screen.
If something goes wrong, for instance the file does not exist, the Python interpreter stops execution with an error
message. For the moment, we do not do any error checking when operating with files (this is very bad practice!).
Note: The read method and all other methods for reading and writing files can be used to process text data and
binary data. Providing the 'r' argument to open tells Python to open the file as text file. Reading data from the
file results in a string object. If ‘rb' is used instead, then the file is handled as binary file and reading results in a list
of bytes. Details will be discussed in the chapter on Text Files (page 114).
Default mode is 'r'. So specifying no mode opens for reading in text mode.
Important methods for reading and writing files are read, readline, readlines, write, writelines,
seek. See methods of file objects77 in the Python documentation.
For more details on access modes see documention of open78 .
Paths to a file are operating system dependent. Thus, using paths in the open function makes our code operating
system dependent. This should be avoided and luckily there are techniques to avoid such OS dependence.
Linux/Unix/macOS
In Linux and other Unix like systems (macOS for instance), all files can be accessed via paths of the form ‘/
directory/subdirs/file’. That is, a list of directory names separated by slashs and ending with the file
name. If the path starts with a slash, then it’s an absolute path, else a relative one.
Drives can be mounted as directory everythere in the file system’s hierarchy. Thus, there is no need for special drive
related path components.
Windows
Windows uses a different format: 'drive:\directory\subdirs\file'. Instead of slashs backslashs are
used as delimiters and there is an additional drive letter in absolute paths followed by a colon. The purpose of the
drive letter is to select one of several physical (or even logical) drives.
From the programmer’s point of view, additional effort is required to make code work in both worlds.
77 https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
78 https://docs.python.org/3/library/functions.html#open
The Python module os.path provides the function join. This function takes directory names and a file name as
arguments and returns a string containing the corresponding path with appropriate (OS dependent) delimiters. So
output of the follow code snipped depends on the OS used for execution.
import os.path
print(test_path)
testdir/testfile.txt
import os
path separator: /
Often data sets are scattered over many files, for instance one file per customer, each file containing all the customers
transactions in an online shop. In such cases we need to get a list of all files in a specified directory. Such functionality
is provided by Python’s glob module.
import glob
file_list = glob.glob('testdir/*')
testdir/test.zip
testdir/testwrite-windows.txt
testdir/iso8859-1.txt
testdir/umlauts.txt
testdir/testfile.txt
testdir/testwrite.txt
testdir/utf-8.txt
The glob module’s glob function takes a path containing wildcards like * (arbitrary string) and ? (arbitrary char-
acter), for instance, and returns a list of all files matching the specified path.
Text files are the most basic type of files. They contain string data. Historically there was a one-to-one mapping
between byte values (0…255) and characters. Nowadays things are much more complex, because representing all
the world’s languages requires more than 256 different characters. When reading from and writing to text files the
mapping between characters and there numerical representation in memory or storage devices is of uttermost impor-
tance.
Text file not only contain so called pritnable characters like letters and numbers, but also control characters like line
breaks and tab stops. Related issues will be discussed in this chapter, too.
10.2.1 Encodings
Every kind of data has to be converted to a stream of bits. Else it cannot be processed by a computer. For strings we
have to distinguish between their representation on screen (which symbol) and their representation in memory (which
sequence of bits). Mapping between screen and memory representation is known as encoding. Decoding is mapping
in opposite direction.
Fig. 10.1: Fortunately, the charging one has been solved now that we’ve all standardized on mini-USB. Or is it micro-
USB? Shit. Source: Randall Munroe, xkcd.com/92779
ASCII
Historically, each character of a string has been encoded as exactly one byte. A byte can hold values from 0 to
255. Thus, only 256 different characters are available, including so called control characters like tabs and new line
characters.
The mapping between byte values and characters, the so called character encoding, has to be standardized to allow
exchanging text files. For a long time, the most widespread standard has been ASCII (American Standard Code for
Information Interchange). But since ASCII does not contain special characters like umlauts in other languages, several
other encodings were developed. The ISO 885980 family is a very prominent set of ASCII derivates.
79 https://xkcd.com/927
80 https://en.wikipedia.org/wiki/ISO/IEC_8859
The first 128 characters of almost all encodings coincide with ASCII, but the remaining 128 contain different symbols.
Thus, to read text files one has to know the encoding used for saving the file. Typically, the encoding is not (!) saved
in the file, but has to be guessed or communicated along with the file. Have a look at the list of encodings81 Python
can process.
Unicode
Nowadays, Unicode is the standard encoding. More precisely, Unicode defines a group of encodings. We do not go
into the details here. For our purposes it suffices to know that Unicode contains several hundred thousand symbols
and the most important encoding of Unicode is called UTF-8. The eight means that most characters require only 8
bits. The symbols associated with the byte values 0 to 127 coincide with ASCII. A byte value above 127 indicates a
multi-byte symbol comprising two, three, or four bytes.
Linux/Unix/macOS
Non-Windows systems (Linux, Unix, macOS) have native UTF-8 support for decades. It’s the standard encoding for
Websites and other internet related applications.
Windows
Windows, even Windows 10, uses a different Unicode encoding under the hood and supports UTF-8 at the surface
only. Sometimes, if one has to dig deeper into the system, unexpected things may happen. Older Windows version
did not have UTF-8 support at all. Always check the encoding if you work with text data generated on a Windows
system!
Encodings in Python
Python uses UTF-8 and strictly distinguishs between strings and their encoded representation. The string is what we
see on screen, whereas the encoded form is what is written to memory and storage devices.
String objects provide the encode member function. This function returns a sequence of bytes. This sequence is of
type bytes. A bytes object is immutable. In essence, it’s a tuple of integers between 0 and 255.
The other way round bytes objects provide a member function decode to transform them to strings.
As we see, bytes objects can be specified like strings, but prefixed by b. The only difference is that all bytes
holding values above 127 or non-printable characters (line breaks, for instance) are replaced by their integer values
in hexadecimal notation with the prefix \x, which is the escape sequence for specifying characters in hexadecimal
notation. If we want to use octal notation, the escape sequence is \000 where 000 is to be replaced by a three digit
octal number.
c = b.decode()
print(c)
Note: The encode and decode methods accept an optional encoding parameter, which defaults to 'utf-8'.
There is also a mutable version of bytes objects: bytearray objects. They provide a decode function, too.
81 https://docs.python.org/3/library/codecs.html#standard-encodings
Reading from a file opened in text mode is equivalent to reading after opening in binary mode followed by a call to
decode. Similarly for writing. The open function has knowns an optional encoding parameter for text mode,
defaulting to 'utf-8'.
Encoding line breaks in text files is done differently on different operating systems. The ASCII and Unicode standards
define two symbols indicating a line break. One is symbol 10, known as line feed (LF for short). The other is symbol
13, known as carriage return (CR for short).
Historically, when typewriters were the standard text processing tools, starting a new line required two actions: move
to next line without moving the carriage, then move the carriage to its rightmost position. Thus, there are two different
symbols for these two actions.
Linux/Unix/macOS
Linux and other Unix like system (macOS, for instance) use single byte line breaks encoded by LF. Old versions of
macOS used CR, but then developers switched to LF.
Windows
Windows adhers to the two-step legacy from pre-computer era. That is, on Windows line breaks in text data are
encoded by the two bytes CR and LF.
Python can handle all three versions of line break codes (LF, CR, CR LF) and tries to hide the differences from the
programmer. But be aware, that writing text files may produce different results on Windows and Linux/Unix/macOS
machines.
import os.path
Wrong Encoding
If we open an ISO 8859-1 encoded text file without specifying an encoding (that is, UTF-8 is used), the interpreter
fails either fails to interpret some bytes or it shows wrong symbols.
print(text)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 f = open(os.path.join('testdir', 'iso8859-1.txt'), 'r')
----> 2 text = f.read()
3 f.close()
5 print(text)
If we open an UTF-8 encoded file with ISO 8859-1 decoding we see garbled symbols.
print(text)
On Linux and Co. the file will have 18 bytes. On Windows it will have 28 bytes due to Windows’ 2-byte line breaks.
Opening the file in binary mode shows the line break encoding:
print(tuple(text))
(116, 101, 115, 116, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 116, 101, 115, 116)
Using print(text) directly shows line breaks as \n, which is nice almost always, but not here. So we convert
the bytes object to a tuple of integers before printing.
If the file has been writen on a Windows machine, it looks like that:
(116, 101, 115, 116, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13, 10, 13,
↪ 10, 13, 10, 13, 10, 116, 101, 115, 116)
Large data sets usually ship as compressed files, mostly ZIP files. Extracting such files requires lots of disk space.
Compressed text files, for instance, are much smaller than the original files (factor 5 to 10).
In Python we might use the zipfile module. This module allows to read single files from a ZIP archive without
extracting the whole archive. We have to create an object of type zipfile.ZipFile. Such objects provide an
open method. The return value of open is a file-like object, that is, it can be processed like usual files. Files from
ZIP archives are always opened in binary mode by ZipFile objects’ open method.
The namelist method returns a list of file names in the ZIP archive.
import os.path
import zipfile
['another_file.txt', 'file.txt']
file contents:
This is a file for testing zipfile module.
The simplest form for storing spreadsheet data are comma separated values (CSV) in a text file. Each line of a CSV
file contains one row of the spreadsheet. The columns are separated by commas and sometimes by another symbol.
CSV files may contain column headers in their first line(s).
A typical CSV file looks like that:
first_name,last_name,town
John,Miller,Atown
Ann,Abor,Betown
Bob,Builder,Cetown
Nina,Morning,Detown
CSV files are not standardized. Thus, there might be cumbersome deviations from what one expects to be a simple
CSV file. The CSV format is used to move data between different sources which cannot read each others native file
formats.
In Python we can use the module csv for reading data from CSV files. It provides the class csv.reader. When
creating a csv.reader object we have to pass a file object of the CSV file as parameter. The csv.reader object
then is an iterator object. It yields one line of the CSV file per iteration. More precisely, it yields a list of strings.
Each string contains the data from the corresponding column.
See documentation of csv module82 for details.
82 https://docs.python.org/3/library/csv.html
HTML (hypertext markup language) files are text files containing additional information for rendering text like font
type, font size, foreground and background colors. Also images, tables and other objects may be described or ref-
erenced by a HTML document. Typically, HTML files are interpreted and rendered by web browsers. Almost all
websites consist of HTML files.
In data science knowing some basic HTML is important for webscraping, that is, for automatically extracting infor-
mation from websites.
<html>
<head>
<title>Title of webpage</title>
</head>
<body>
<h1>Some heading</h1>
<p>Text and text and more text in a paragraph.
Here comes a <a href="http://some.where">link to somewhere</a>.</p>
</body>
</html>
The file starts with <html> and ends with </html>. Then there is a head and a body. The head contains auxiliary
information like the webpage’s title, which is often shown in the browser window’s title bar. The body contains the
contents of the page.
There are many different HTML tags to influence rendering of the contents.
Headings from large to small: h1, h2, h3, h4, h5.
Paragraph: p.
Link: a with attribute href.
Table: table, tr (row inside table), td (cell inside row), and some more.
Image: img with attribute src (the URL of the image).
Invisible elements for layout control: span (inline element), div (box).
All tags have the attributes style (for specifying font size, colors and so on), id (a unique identifier for advanced
style control and scripting), class (an identifier shared by several elements for advanced layout control).
Have a look at the HTML documentation83 for details.
Modern browsers have tools to help understand a HTML file’s structure. In Firefox or Chromium right-click some
element of the webpage and click ‘Inspect’ in the pop-up menu. Then navigate through the HTML source. To see the
whole HTML source code right-click and choose ‘View Page Source’.
83 https://html.spec.whatwg.org/
There are several modules available for parsing HTML files in Python. Here, parsing means to convert the textual
representation into more structured Python objects. One such module is Beautiful Soup84 , which is not part of
Python’s standard library, but has to be installed manually.
For installation use beautifulsoup4. For importing bs4 is the correct name.
import bs4
We have to create a BeautifulSoup object, whoes contructor takes a string or an opened file object as argument.
The BeautifulSoup object then provides methods to find HTML tags by specifying tag name, id attribute, class
attribute or one of several other properties. We do not have to write code for parsing HTML files. Instead we can
search the file with BeautifulSoup’s methods.
html = '''\
<html>
<head>
<title>Title of webpage</title>
</head>
<body>
<h1>Some heading</h1>
<p>Text and text and more text in a paragraph.
Here comes a <a href="http://some.where">link to somewhere</a>.</p>
</body>
</html>
'''
soup = bs4.BeautifulSoup(html)
The find_all method returns a list of objects representing subsets of the HTML file matching the arguments
passed to find_all. In the following code snippet we search for a tags, that is, for links. But we could also search
for certain attribute values and other criteria. There is also a find method which returns the first occurrence only.
The objects returned by find_all and find themselves provide corresponding methods to refine search.
print('#links:', len(links))
print('last link:', links[-1])
#links: 1
last link: <a href="http://some.where">link to somewhere</a>
XML (extensible markup language) files look like HTML Files (page 119), but with custom tag names. Each XML
file may use its own set of tags to describe data. In principle, HTML is a special case of XML.
Standard conforming XML files have some format specifications in there first lines. Content without such specifica-
tions could look like this:
<person>
<first_name>John</first_name>
<last_name>Miller</last_name>
<town>Atown</town>
</person>
<person>
<first_name>Ann</first_name>
<last_name>Abor</last_name>
<town>Betown</town>
</person>
<person>
<first_name>Bob</first_name>
<last_name>Builder</last_name>
<town>Cetown</town>
</person>
<person>
<first_name>Nina</first_name>
<last_name>Morning</last_name>
<town>Detown</town>
</person>
HTML files can be parsed like HTML files with Beautiful Soup.
import bs4
xml = '''\
<person>
<first_name>John</first_name>
<last_name>Miller</last_name>
<town>Atown</town>
</person>
<person>
<first_name>Ann</first_name>
<last_name>Abor</last_name>
<town>Betown</town>
</person>
<person>
<first_name>Bob</first_name>
<last_name>Builder</last_name>
<town>Cetown</town>
</person>
<person>
<first_name>Nina</first_name>
<last_name>Morning</last_name>
<town>Detown</town>
</person>
'''
soup = bs4.BeautifulSoup(xml)
#towns: 4
last town: <town>Detown</town>
Today’s primary source of data is the world wide web. In the simplest case we may download a data set as one single
file. Many data providers instead offer an API (application programming interface) for accessing and downloading
data. The worst case is if we have to scrape data from a website’s HTML and other files.
Websites and other web services are hosted on a server somewhere in the world. If we type an URL (web address)
into a browser’s address bar, the browser connects to the corresponding server and asks him to send the desired file
to the user’s computer. This process is referred to as requesting a file or sending a request. Our computer is the client,
asking the server for some service (send a file). It’s important to understand that we cannot simply collect a file from
a remote server. We only may send a request to the server to send a file to us. The server may fulfill our request or
send an error message or do not answer the request at all.
The technology behind is much more involved than one might think: How to find the correct server? Which language
to speak with the server? What to do if the server does not answer the request? And so on. If you are interested in
some background details, use DNS86 and HTTP87 as entry points.
The Python interpreter may take the role of the browser and request files from servers.
To download a webpage or some other file from the the web we may use the requests module from the Python
standard library.
The module provides a function get which takes the URL and yields a Response object. The Response object
contains information about the server’s answer to our request. If the request has been succesful, the content
member variable contains the requested file as bytes object.
import requests
response = requests.get('https://www.fh-zwickau.de/~jef19jdw/index.html')
print(response.content.decode())
86 https://en.wikipedia.org/wiki/Domain_Name_System
87 https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
Many webpages are to some extent dynamic. Their content can be influenced by passing parameters to them. Different
techniques exist for this purpose. Most common are so-called ‘GET’ and ‘POST’. We only consider the first method
here.
Passing arguments via ‘GET’ is very simple. We just add them to the URL. If the webpage processes arguments with
names arg1, arg2, arg3 and if we want to pass corresponding values value1, value2, value3, we may
request the URL
http://some.where/some_page.html?arg1=value1&arg2=value2&arg3=value3
The requests.get function knows the keyword argument params to increase readability. Instead of composing
a long URL string we may write:
url = 'http://some.where/some_page.html'
params = {'arg1': 'value1',
'arg2': 'value2',
'arg3': 'value3'}
response = requests.get(url, params=params)
Most web services for data retrieval do not return HTML document, but more machine readable formats like CSV88 ,
JSON89 , or YAML90 . There are Python modules for parsing all common formats.
Sometimes data we want to analyze is scattered over a website. No direct connection to the underlying data base is
available. Thus, we have to find ways to extract data from websites automatically. The process of extracting data
from websites is referred as web scraping.
Legal considerations
There is no law which directly prohibits web scraping. But a website or part of it may be protected by copyright law.
Almost all large websites have terms of use, which have to be respected by the user. Some websites explicitly prohibit
automated data extraction. Some only prohibit commercial use of the provided data. Before starting a scraping project
read the terms of use!
When in doubt ask the website provider for written permission to scrape data from the site or ask a lawyer!
Another issue is the web traffic caused by scrapers. A scraping project might require several thousand requests to
a server within very short time. This may hurt the providers infrastructure. A common attack for getting down a
website is to send thousands of requests fast enough to prevent the server from answering requests from other users
(DoS attack, denial-of-service attack). We don’t want to be attackers. Thus, whenever you start a scraping project,
tell your script to wait a few seconds between consecutive requests to a server!
88 https://en.wikipedia.org/wiki/Comma-separated_values
89 https://en.wikipedia.org/wiki/JSON
90 https://en.wikipedia.org/wiki/YAML
Download a webpage via requests.get only yields the HTML document. Images and other media usually are
not contained in HTML files. To download all images of a webpage we would have to find all img tags in the HTML
file, then extract URLs from corresponding src attributes, and then download each URL separately.
Scraping data from websites often is tedious work and each scraping project requires different techniques for data
extraction. Knowing some little helpers may save the day.
Regular Expressions
There is a mini language to describe query strings for text search. Corresponding search strings are called regular
expressions. They can be used, for instance, in conjunction with Beautiful Soup. We do not go into the details here.
Just an example:
print(result)
Most data contains time stamps. Python ships with the modules datetime and time for handlung dates and times.
The former provides tools for carrying out calculations with dates and times. The latter provides different time-related
functionality.
datetime provides objects expressing a point in time (date, time, datetime) and objects expressing a duration
(timedelta).
import datetime
print(f'It\'s {new_date.day:02}.{new_date.month:02}.{new_date.year}.')
It's 07.07.2020.
import time
print('Have a break...')
time.sleep(5) # seconds
print('...now I\'m back.')
Have a break...
...now I'm back.
93 https://docs.python.org/3/library/time.html
ELEVEN
FUNCTIONS
We already met functions in the Crash Course (page 43). Here we repeat the basics and add lots of details important
for successful Python programming.
• Basics (page 127)
• Passing Arguments (page 128)
• Anonymous Functions (Lambdas) (page 132)
• Function and Method Objects (page 132)
• Recursion (page 133)
Related exercises: Functions (page 265).
11.1 Basics
A function has a name, which is used to call the function. A function can take arguments to control its behavior, and
a function may return a value, which then can be used by the calling code.
If a function shall return a value, then its code block has to contain the following line at least once (usually the last
line):
return some_value
The return keyword immediately stops execution of the function’s code and hands control back to the calling code,
which then can use the content of some_value.
If there is no return keyword in a function, then the function ends after executing its last line and returns None.
In this sense, Python functions always return a value.
127
Data Science and Artificial Intelligence for Undergraduates
After finishing execution of the function’s code a contains the return value of the function. If the function does not
return a value or the return value is not needed by the calling code, then the assignment a = ... can be omitted.
In Python by convention every function definition contains a triple quoted documentation string. This string is ignored
by the Python interpreter, but read by tools for automatic generation of source code documention.
def my_function():
'''Does nothing.
For some formatting conventions see Python documention94 . More details: PEP 25795 . There are many different
conventions for docstring formatting. PEP 257 is only one of them.
In contrast to several other programming languages Python provides very flexible and readable syntax constructs for
passing data to functions. Here we’ll also discuss what happens in memory when passing data to functions.
Positional arguments have to be passed in exactly the same order as they appear in the function’s definition. There
can be as many positional arguments as needed. But a function may come without any positional arguments at all,
too.
Positional arguments may have a default value, which is used if the argument is missing in a function call. Syntax:
If there are mandatory arguments (that is, without default value) and optional arguments, then the latter have to follow
the former.
94 https://docs.python.org/3/tutorial/controlflow.html#tut-docstrings
95 https://www.python.org/dev/peps/pep-0257/
If there are several optional arguments and only some shall be passed to the function, they can be provided as keyword
arguments. The order of keyword arguments does not matter when calling a function.
my_function(arg3=12345, arg2=54321)
Keyword arguments are very common in Python, since they increase readability when calling functions with many
arguments.
If we need a function which can take an arbitrary number of arguments, we may use the following snytax:
Then other_args contains a tuple of all arguments passed to the function, but without arg1 and arg2.
If we need a function which can take an arbitrary number of keyword arguments, we may use the following snytax:
Then other_kwargs contains a dictionary of all keyword arguments passed to the function, but without kwarg1
and kwarg2.
If we have a list or tuple and all items shall be passed as single arguments to a function, then we should use argument
unpacking:
my_list = [4, 3, 1]
some_function(*my_list)
some_function(my_list[0], my_list[1], my_list[2])
Python never copies objects passed to a function. Instead, the argument names in a function definition are tied to the
objects, whose names are given in the function call.
a = 5
b = 'some string'
print(id(a), id(b))
some_function(a, b)
139792935403888 139792871198064
139792935403888 139792871198064
Here we have to take care: if we pass mutable objects to a function, then the function may modify these objects!
def clear_list(l):
for k in range(0, len(l)):
l[k] = 0
my_list = [2, 5, 3]
clear_list(my_list)
print(my_list)
[0, 0, 0]
Always look up a function’s documentation if you have to pass mutable objects to a function. If the function modifies
an object, this fact should be provided in the documentation. For instance, several functions of the OpenCV96 library,
which we’ll use later on, modify their arguments without proper documentation.
A similar issue arises if we use mutable objects as default values for optional arguments. The name of an optional
argument is tied to the object only once during execution (at time of function definition). If this object gets modified,
then the default value changes for subsequent function calls.
append42()
append42([1, 2, 3])
append42()
append42()
[42]
[1, 2, 3, 42]
(continues on next page)
96 https://opencv.org
append42()
append42([1, 2, 3])
append42()
append42()
[42]
[1, 2, 3, 42]
[42]
[42]
In a call to the function the first group of arguments has to be passed without keyword, the second group may be
passed with or without keyword, and the third group has to be passed by keyword.
The reason for existence of this technique is quite involved and, presumably, we won’t need this feature. But we
should know it to understand code written by others.
Flexibility of argument passing makes it hard to clearly document which variants a library function accepts. Python’s
documentation uses a special syntax to state type and number of arguments as well as default values of a function.
Example: The glob module’s glob function (see File IO (page 111)) is shown in Python’s documentation97 as
follows:
We see:
• pathname is the only positional argument.
• There are three arguments which have to be passed by keyword.
97 https://docs.python.org/3/library/glob.html#glob.glob
input([prompt])
Sometimes one has to call functions which take a function as argument. Passing a function as argument is very simple,
just give the function’s name as argument:
some_function(my_function)
Often, functions passed to other functions are needed only once in the code and almost always they have very simple
structure. Providing a full function definition and wasting a name for such throwaway functions, thus, should by
avoided. The tool for avoiding this overhead are anonymous functions, in Python known as lambdas. Here is an
example:
The lambda keyword creates a function in the same way as def, but without assigning a name to it. Keyword
arguments are allowed, too. In principle it is possible to define named functions with lambda:
def my_function():
print('Oh, you called me!')
print(type(my_function))
print(id(my_function))
print(dir(my_function))
98 https://docs.python.org/3/library/functions.html#input
99 https://docs.python.org/3/library/functions.html#print
<class 'function'>
140438181160432
['__annotations__', '__builtins__', '__call__', '__class__', '__closure__', '__
↪code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__
↪str__', '__subclasshook__']
print(my_function.__name__)
my_function
class my_class:
def some_method(self, some_string):
print('You called me with {}'.format(some_string))
my_object = my_class()
print(type(my_object.some_method))
print(type(my_class.some_method))
<class 'method'>
<class 'function'>
If the method my_object.some_method is called, then the Python interpreter inserts the owning object as first
argument and calls the corresponding function my_class.some_method. In other words, the following two lines
are equivalent:
my_object.some_method('Hello')
my_class.some_method(my_object, 'Hello')
11.5 Recursion
A useful programming technique is recursion. That is, a function calls itself until some stopping criterion is satisfied.
To illustrate this approach let’s have a list. Each item either is an integer or a list again. If it is a list, then each item
of this list is an integer or another list, and so on. The task is to calculate the sum of all integers. This can’t be solved
by nested for loops, because we do not know the depth of the list nesting in advance.
def sum_list(l):
''' Sum up list items recursively. '''
current_sum = 0
for k in l:
if type(k) == int:
current_sum += k
else:
(continues on next page)
return current_sum
print(sum_list(a))
36
TWELVE
We already met modules and packages in the Crash Course (page 43). Here we add the details and learn how to write
new modules and packages.
To get access to the functionality of a module it has to be imported into the source code file or into the interactive
interpreter session:
import module_name
This creates a Python object with name module_name (everything is an object in Python!) whoes methods are
the functions defined on the module file. All names (functions, types, and so on) defined in the module then can be
accessed this way:
module_name.some_function()
Use the built-in function dir to get a list of all names defined in an imported module.
import datetime
print(dir(datetime))
A module’s name can appear very often in source code. The as keyword allows to abbreviate module names:
mod.some_function()
It is even possible to avoid typing module names at all. If only few functions of a module are needed, then they can
be imported directly:
some_function()
some_other_function()
135
Data Science and Artificial Intelligence for Undergraduates
func()
some_function()
But be careful; modules may contain hundreds of functions and importing all these functions may slow down your
code.
Note: The import statement makes the Python interpreter look for a file module_name.py. If it cannot find a
built-in module with this name (that is, a module integrated directly into the interpreter), then it looks in the directory
containing the source code file. Then several other directories are taken into account. If the interpreter does not find
the requested module, an error message is shown.
Packages are collections of modules and can be imported in the same way as modules. Packages may contain sub-
packages. An import statement could look like this:
import package_name
package_name.subpackage_name.module_name.some_function()
The import statement creates a tree of objects and subobjects which reflects the structure of the package. To import
only one subpackage or one module from a package, use from:
subpackage_name.module_name.some_function()
or
module_name.some_function()
Python ships with a large number of modules and packages, known as Python standard library. Have a look at the
complete list100 of the standard library’s contents and also at Brief Tour of the Standard Library101 as well as Brief
Tour of the Standard Library - Part II102 .
We already introduced some of the standard library’s modules and packages (datetime and os.path for instance)
and we will continue to introduce new functionality when needed for our purposes.
100 https://docs.python.org/3/library/index.html
101 https://docs.python.org/3/tutorial/stdlib.html
102 https://docs.python.org/3/tutorial/stdlib2.html
Writing our own module is very simple: Put function definitions in a file my_module.py and import it with im-
port my_module. Writing modules makes code reusable and increases readability.
When importing a module the Python interpreter executes the module file. If you want to use a Python source code
file as script as well as as module, you might check the value of the pre-defined variable __name__. If __name__
== '__main__', then the code is being executed as a script. If __name__ == 'module_name', then the
code is being run due to an import statement.
It’s also possible to use compiled modules103 .
Of course you can write your own packages. A package then is a directory package_name which contains the
Python files for all modules in the package and in addition a file __init__.py. This file might be empty, but is
required to mark a directory as Python package. Subpackages are subdirectories with __init__.py file.
For details see Python’s documentation104 .
Python does not support hidden functions or variables in modules. Also hiding members of a class from the user of
the class is not possible. But sometimes this would be quite useful. Variables needed only for internal calculations
or little helper functions and methods shouldn’t be visible from outside, because, if they were hidden, then we could
change their name or remove them completely if desired without breaking source code which uses the module or
class.
In Python there is a convention to mark private members: a leading underscore like in _hid-
den_by_convention. Variables, functions and methods preceded by an underscore should not be accessed
or called from outside the module or class definition. But this is a convention. Nothing prevents you from violating
this convention.
103 https://docs.python.org/3/tutorial/modules.html#compiled-python-files
104 https://docs.python.org/3/tutorial/modules.html#packages
THIRTEEN
Up to now we did not care about error handling. If something went wrong, the Python interpreter stopped execution
and printed some message. But Python provides techniques for more controlled error handling.
The Python interpreter parses the whole souce code file before execution. In this phase the interpreter may encounter
syntax errors. That is, the interpreter does not understand what we want him to do. The code does not look like
Python code should look like. Syntax errors are easily recovered by the programmer.
The more serious types of errors are runtime errors (or semantic errors) which occur during program execution.
Handling runtime errors is sometimes rather difficult.
The traditional way for handling runtime errors is to avoid runtime errors at all. All user input and all other sources
of possible trouble get checked in advance by incorporating suitable if clauses in the code. This approach decreases
readability of code, because the important lines are hidden between lots of error checking routines.
The more pythonic way of handling runtime errors are exceptions. Everytime the interpreter encounters some prob-
lem, like division by zero, it throws an exception. The programmer may catch the exception and handle it appropriately
or the programmer may leave exception handling to the Python interpreter. In the latter case, the interpreter usually
stops execution and prints a detailed error message.
try:
# code which may cause troubles
except ExceptionName:
# code for handling a certain exception caused by code in try block
except AnotherExceptionName:
# code for handling a certain exception caused by code in try block
else:
# code to execute after successfully finishing try block
The try block contains the code to be protected, that is, the code which might raise an exception. Then there is at
least one except block. The code in the except block is only executed, if the specified exception has been raised.
In this case, execution of the try block is stopped immediately and execution continues in the except block.
139
Data Science and Artificial Intelligence for Undergraduates
There can be several except blocks for handling different types of exceptions. Instead of an exception name also a
tuple of names can be given to handle several different exceptions in one block.
The else block is executed after successfully finishing the try block, that is, if no exception occurred. Here is
the right place for code which shall only be executed if no exception occurred, but for which no explicit exception
handling shall be implemented.
Here is an example:
try:
b = 1 / a
except ZeroDivisionError:
print('Division by zero. Setting result to 1000.')
b = 1000 # set b to some (reasonable) value
else:
print('Everything okay.')
print('Result is {}.'.format(b))
Without using exception handling the interpreter would stop execution in the division line. By catching the exception
we can avoid this automatic behavior and handle the problem in a way which does not prevent further program
execution.
Note that exception names are not strings, but names of object types (classes). Thus, don’t use quotation marks.
print(type(ZeroDivisionError))
print(dir(ZeroDivisionError))
<class 'type'>
['__cause__', '__class__', '__context__', '__delattr__', '__dict__', '__dir__',
↪'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__
The Python documentation contains a list of built-in exceptions105 . There is a kind of tree structure in the set of all
exceptions and we may define new exceptions if we need them to express errors specific to our program. These topics
will be discussed in detail when delving deeper into object oriented programming.
13.1.4 Clean-Up
Sometimes it’s necessary to do some clean-up operations like closing a file no matter an exception occurred or not
while keeping the file open for reading and writing. For this purpose Python provides the finally keyword:
try:
# code which may cause troubles
except ExceptionName:
# code for handling a certain exception caused by code in try block
else:
# code to execute after successfully finishing try block
finally:
# code for clean-up operations
105 https://docs.python.org/3/library/exceptions.html#concrete-exceptions
The finally block is executed after the try block (if there is no else block) or after the else block if no
exception occured. If an exception occurred, then the finally block is executed after the corresponding except
clause. If try or except clauses contain break, continue or return, then the finally block is executed
before break, continue or return, respectively. If a finally block executed before return contains a
return itself, then finally’s return is used and the original return is ignored.
Note: As long as a file is opened by our program the operating system blocks file access for other programs. Thus,
we should close a file as soon as possible. Forgetting to close a file is not too bad because the OS will close it for use
after program execution stopped. But for long running programs with only short file access at start-up a non-closed
file may block access by other programs for hours or days. Thus, always, especially in case of exception handling,
make sure that in each situation (with or without exception) files get closed properly by the program.
Some object types, file objects for instance, include predefined clean-up actions. That is, for certain operations (e.g.,
opening a file) they define what should be done in a corresponding finally block (e.g., closing the file), if the
operations would be placed in a try block.
To use this feature Python has the with keyword:
with open('some_file') as f:
# do something with file object f
If the open function is successful, then the indented code block is executed. If open fails, an exception is raised.
In both cases, with ensures, that proper clean-up (closing the file) takes place.
Objects which can be used with with are said to support the context management protocol. Such objects can also be
defined by the programmer using dunder methods, see Python’s documentation106 for details.
The purpose of with is to make code more readable by avoiding too many try...except...finally blocks.
Up to now we considered syntax errors, which basically are typos in the code, and semantic errors, which are caused
by unexpected user input or failed file access. But code may contain more involved semantic errors, which may be
hard to identify. The process of finding and correcting semantic errors is known as debugging.
A simple approach to debugging is to print status information during program flow. For private scripts and a data
scientist’s everyday use this suffices. For higher quality programs the Python standard library provides the logging
package, which allows to redirect some of the status information to a log file. Logging basics are described in the
basic logging tutorial107 .
If looking at log messages does not suffice, there are programs specialized to debugging your code. We do not cover
this topic here. But if you are interested in you should have a look at The Python Debugger108 and at Debugging with
Spyder109 .
106 https://docs.python.org/3/library/stdtypes.html#context-manager-types
107 https://docs.python.org/3/howto/logging.html#basic-logging-tutorial
108 https://docs.python.org/3/library/pdb.html
109 https://docs.spyder-ide.org/debugging.html
13.3 Profiling
Sometimes our code does what you want it to do, but it is too slow or consumes too much memory (out of memory
error from the operating system). Then it’s time for profiling.
You may use the Spyder Profiler110 or import profiling functionality from suitable Python packages.
The timeit module provides tools for measuring a Python script’s execution time in seconds.
import timeit
a = 1.23
code = """\
b = 4.56 * a
"""
0.06846186006441712
This code snipped packs some code into the string code and passes it to the timeit function. This function
executes the code number times to increase accuracy. The built-in function globals returns a list of all defined
names. This list should be passed to the timeit function to provide access to all names.
Have a look at the The Python Profilers111 , too.
Note: If working in Jupyter you may use the %timeit and %%timeit112 magics instead of the timeit module,
the former for timing one line of code (%timeit one_line_of_code), the latter for timing the whole code
cell (place it in the cell’s first line).
From data science view also memory consumption is of interest, because handling large data sets requires lots of
memory. There are many ways to obtain memory information. A simple one is as follows (install module pympler
first):
print(asizeof.asizeof(my_string))
print(asizeof.asizeof(my_int))
72
32
This gives the size of the memory allocated for some object. This number also includes the size of ‘subobjects’, that
is, for example, all the objects referenced by a list object are included.
110 https://docs.spyder-ide.org/profiler.html
111 https://docs.python.org/3/library/profile.html
112 https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit
FOURTEEN
INHERITANCE
Inheritance is an important principle in object-oriented programming. Although we may reach our aims without using
inheritance in our own code, it’s important to know the concept and corresponding syntax constructs to understand
other people’s code. We’ll meet inheritance related code in
• the documentation of modules and packages,
• when customizing and extending library code.
Related exercises: Object-Oriented Programming (page 267).
Inheritance is a technique to create new classes by extending and/or modifying existing ones. A new class may have
a base class. The new class inherits all methods and member variables from its base class and is allowed to replace
some of the methods and to add new ones. Syntax:
class NewClass(BaseClass):
143
Data Science and Artificial Intelligence for Undergraduates
The only difference compared to usual class definitions is in the first line, where a base class can be specified. Defining
methods works as before. If the method name does not exist in the base class, then a new method is created. If it
already exists in the base class, the new one is used instead of the base class’ method. In addition to explicitly defined
methods, the new class inherits all methods from the base class.
Inheritance saves time for implementation and leads to a well structured class hierachy. Object-oriented program-
ming is not solely about defining classes (encapsulation and abstraction), but also about defining meaningful relations
between classes, thus, to some extent mapping real world to source code.
14.3 Example
Real-life examples of inheritance often are quite involved. For illustration we use a pathological example resampling
relations between geometric objects.
Imagine a vector drawing program. Each geometric object shall be represented as object of a corresponding class.
Say quadrangles are objects of type Quad, paraxial rectangles are objects of type ParRect and so on. Let’s start
with class Point:
class Point:
''' represent a geometric point in two dimensions '''
def __str__(self):
return f'({self.x}, {self.y})'
class Quad:
''' represent a quadrangle '''
def get_points(self):
return (self._a, self._b, self._c, self._d)
def __str__(self):
return f'quadrangle with points' \
f'({self._a.x}, {self._a.y}), ({self._b.x}, {self._b.x}), ' \
f'({self._c.x}, {self._c.x}), ({self._d.x}, {self._d.x})'
The member variables _a, _b, _c, _d are hidden since we consider them implementation details. If the user wants
access to the four points making the quadrangle, get_points should be called. This way we are free to store the
quadrangle in a different format if it seems resonable in future when extending class’ functionality. This is a design
decision and is in no way related to inheritance.
Here comes ParRect: Note that a paraxial rectangle is defined by two Points.
class ParRect(Quad):
def __str__(self):
return f'paraxial rect with points ({self._a.x}, {self._a.y}), ({self._c.
↪x}, {self._c.x})'
def area(self):
''' return the rect's area '''
return abs(self._b.x - self._a.x) * abs(self._d.y - self._a.y)
The ParRect class inherits everything from Quad. It has a new constructor with fewer arguments than in Quad,
but calls the constructor of Quad.
Important: The built-in function super in principle returns self (that is, the current object), but redirects method
calls to the base class.
We reimplement __str__ and add the new method area. Note that ParRect objects have member variables
_a, _b, _c, _d since those are created by the Quad constructor we call in the ParRect constructor. Also the
get_points method is a member of ParRect since it gets inherited from Quad.
print(parrect)
print('area: {}'.format(parrect.area()))
a, b, c, d = parrect.get_points()
print('all points:', a, b, c, d)
Note that isinstance also returns True if we check against a base class of an object’s class. In other words, each
object is an instance of its class and of all base classes.
print(isinstance(parrect, ParRect))
print(isinstance(parrect, Quad))
True
True
In contrast, type checking with type returns False if checked against the base class:
print(type(parrect) == ParRect)
print(type(parrect) == Quad)
True
False
In Python there is a built-in class object and every newly created class automatically becomes a subclass of ob-
ject. The line
class my_new_class:
is equivalent to
class my_new_class(object):
and also to
class my_new_class():
by the way.
To see this in code we might use the built-in function issubclass. This function returns True if the first argument
is a subclass of the second.
class MyClass:
def __init__(self):
print('Here is __init__()!')
print(issubclass(MyClass, object))
True
Alternatively, we may have a look at the __base__ member variable, which stores the base class:
print(MyClass.__base__)
<class 'object'>
Objects of type object do not have real functionality. The object class provides some auxiliary stuff used by
the Python interpreter for managing classes and objects.
obj = object()
dir(obj)
['__class__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__ne__',
'__new__',
(continues on next page)
Python does not allow for directly implementing so called virtual methods. A virtual method is a method in a base class
which has to be (re-)implemented by each subclass. The typical situation is as follows: The base class implements
some functionality, which for some reason has to call a method of a subclass. How to guarantee that the subclass
provides the required method?
In Python a virtual method is a usual method which raises a NotImplementedError, a special exception type
like ZeroDivisionError and so on. If everything is correct, this never happens, because the subclass overrides
the base class’ method. But if the creator of the subclass forgets to implement the method required by the base class,
an error message will be shown.
A class may have several base classes. Just provide a tuple of base classes in the class definition:
The new class inherits everything from all its base classes.
If two base classes provide methods with identical names, the Python interpreter has to decide which one to use for
the new class. There is a well-defined algorithm for this decision. If you need this knowledge someday, watch out for
method resolution order (MRO).
Up to now we used built-in exceptions only, like ZeroDivisionError. But now we have gathered enough
knowledge to define new exceptions. Exeptions are classes as we noted before. Each exception is a direct or indirect
subclass of BaseException. Almost all exceptions also are a subclass of Exception, which itself is a direct
subclass of BaseException. See Exception hierarchy113 for exceptions’ genealogy.
If we want to introduce a new exception, we have to create a new subclass of Exception.
class SomeError(Exception):
def my_function():
print('I do something...')
raise SomeError('Meaty error message!!!')
print('Entering my_function...')
(continues on next page)
113 https://docs.python.org/3/library/exceptions.html#exception-hierarchy
Entering my_function...
I do something...
Exception SomeError: Meaty error message!!!
At first we define a new exception class SomeError. The constructor takes an error message and stores it in the
member variable message. The function my_function raises SomeError. The main program catches this
exception and prints the error message. The as keyword provides access to a concrete SomeError object containing
the error message.
Note that except SomeBaseClass also catches all subclasses of SomeBaseClass. If we want to handle a
subclass exception separately we have to place its except line above the base class’s except line. Contrary, a
subclass except never handles a base class exception.
FIFTEEN
Python provides much more features than we need. Here we list some of them, which might be of interest either
because they simplify some coding tasks or because they frequently occur in other people’s code.
Python does not allow emtpy functions, classes, loops and so on. But there is a ‘do nothing’ command, the pass114
keyword.
def do_nothing():
pass
do_nothing()
Especially for debugging purposes placing multiple if statements can be avoided by using assert115 instead. The
condition following assert is evaluated. If the result is False an AssertionError is raised, else nothing
happens. An optional error message is possible, too.
b = 1 / 0
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [2], in <cell line: 3>()
1 a = 0 # some number from somewhere (e.g., user input)
----> 3 assert a != 0, 'Zero is not allowed!'
5 b = 1 / 0
114 https://docs.python.org/3/reference/simple_stmts.html#the-pass-statement
115 https://docs.python.org/3/reference/simple_stmts.html#the-assert-statement
149
Data Science and Artificial Intelligence for Undergraduates
Python 3.10 introduces two new keywords: match and case. In the simplest case they can be used to replace
if...elif...elif...else constructs for discrimating many cases. But in addition they provide a powerful
pattern matching mechanism. See PEP 636 - Structural Pattern Matching: Tutorial116 for an introduction.
Python knows a data type for representing (mathematical) sets. Its name is set. Typical operations like unions and
intersections of sets are supported. The official Python Tutorial117 shows basic usage.
Function decorators are syntactic suggar (that is, not really needed), but very common. They precede a function
definition and consist of an @ character and a function name. Function decorators are used to modify a function by
applying another function to it. The following two code cells are more or less equivalent:
def double(func):
return lambda x: 2 * func(x)
@double
def calc_something(x):
return x * x
def double(func):
return lambda x: 2 * func(x)
def calc_something(x):
return x * x
calc_something = double(calc_something)
The copy module119 provides functions for shallow and deep copying of objects. For discussion of the copy problem
in the context of lists see Multiple Names and Copies (page 96).
116 https://peps.python.org/pep-0636/
117 https://docs.python.org/3/tutorial/datastructures.html#sets
118 https://docs.python.org/3/reference/compound_stmts.html#function-definitions
119 https://docs.python.org/3/library/copy.html
15.7 Multitasking
Reading in large data files or downloading data from some server may take a while. During this time the CPU is
more or less idle and could do some heavy computations without slowing down data transfer. This a typical situation
where one wants to have two Python programs or two parts of one and the same program running in parallel, possibly
communicating with each other.
Python has the threading120 module for real multitasking on operating system level. The asynio121 module in
combination with the async and await keywords provides a simpler multithreading approach completely controlled
by the Python interpreter.
The Python ecosystem provides lots of packages for creating and controlling graphical user interfaces (GUIs). Here
are some widely used ones:
• tkinter122 in Python’s standard library provides support for classical desktop applications with a main win-
dow, subwindows, buttons, text fields, and so on.
• ipywidgets123 is a very easy to use package for creating graphical user interfaces in Jupyter notebooks. For
instance a slider widget could control some parameter of an algorithm.
• flask124 is a package for building web apps with Python, that is, the user interacts via a website with the
Python program running on a server.
120 https://docs.python.org/3/library/threading.html
121 https://docs.python.org/3/library/asyncio.html
122 https://docs.python.org/3/library/tkinter.html
123 https://ipywidgets.readthedocs.io
124 https://flask.palletsprojects.com
153
CHAPTER
SIXTEEN
Almost all data comes as tables with lots of numbers. NumPy125 is a Python package for effciently handling large
tables of numbers. NumPy also provides advanced and very efficient linear algebra operations on such tables, for
instance solving systems of linear equations based on the data. Most machine learning algorithms boil down to
moving large amounts of data and doing some linear algebra. Thus, it’s a good idea to spend some time understanding
NumPy’s basic principles and discovering NumPy’s functionality.
• NumPy Arrays (page 155)
• Array Operations (page 161)
• Advanced Indexing (page 165)
• Vectorization (page 166)
• Array Manipulation Functions (page 168)
• Copies and Views (page 171)
• Efficiency Considerations (page 173)
• Special Floats (page 175)
• Linear Algebra Functions (page 177)
• Random Numbers (page 179)
Related exercises:
• NumPy Basics (page 269)
• Image Processing with NumPy (page 271)
import numpy as np
125 https://numpy.org
155
Data Science and Artificial Intelligence for Undergraduates
From mathematics we know Vectors (page 333) and Matrices (page 334). A vector is a (one-dimensional) list of
numbers. A matrix is a (two-dimensional) field of numbers. Vectors could be represented by lists in Python, whereas
a matrix would be a list if lists (a list of rows or a list of columns).
Using Python lists for representing large vectors and matrices is very inefficient. Each item of a Python list has its
own location somewhere in memory. When reading a whole list, to multiply a vector by some number, for instance,
Python reads the first list item, then looks for the memory location of the second, then reads the second, and so on.
A lot of memory management is involved.
To significantly improve performance, NumPy provides the ndarray data type. The most important property of
an ndarray is its dimension. A one-dimensional array stores a vector. A two-dimensional array stores a matrix.
Zero-dimensional arrays store nothing, but are valid Python objects. Visualization of arrays with dimension above
two is somewhat difficult. A three-dimensional array can be visualized as cuboid of numbers, each number described
by three indices (row, column, depth level). We will meet dimensions of three and above almost every day when
diving into machine learning. One example are color images: two-dimensions for pixel positions, one dimension for
color channels (red, green, blue, transparency).
Why are NumPy arrays more efficient?
• All items of a NumPy array have to have identical data type, mostly float or integer. This saves time and
memory for handling different types and type conversions.
• All items of a NumPy array are stored in a well-structured contiguous block of memory. To find the next
item or to copy a whole array or part of it much less memory management operations are required.
• NumPy provides optimized mathematical operations for vectors and matrices. Instead of processing arrays
item by item, NumPy functions take the whole array and process it in compiled C code. Thus, the item-by-item
part is not done by the (slow) Python interpreter, but by (very fast) compiled code.
There are several ways to create NumPy arrays. We start with conversion of Python lists or tuples by NumPy’s
array function.
Passing a list or a tuple to array yields a one-dimensional ndarray. The data type is determined by NumPy to
be the simplest type which can hold all objects in the list or tuple.
print(a)
print(a.dtype)
[23 42 7 4 -2]
int64
The member variable ndarray.dtype contains the array’s data type. Here NumPy decided to use int64, that
is, integers of length 8 byte. Available types will be discussed below. An example with floats:
print(b)
print(b.dtype)
Important: NumPy ships with its own data types for numbers to allow for more efficient storage and computations.
Python’s int type allows for arbitrarily large numbers, whereas NumPy has different types for integers with different
(and finite!) numerical ranges. NumPy also knows several types of floats differing in precision (number of decimal
places) and range. Wherever possible conversion between Python types and NumPy types is done automatically.
print(c)
[[1 2 3]
[4 5 6]
[7 8 9]]
Next to explicit creation, NumPy arrays may be the result of mathematical operations:
d = a + b
print(type(d))
<class 'numpy.ndarray'>
To see that d is indeed a new ndarray and not an in-place modified a or b, we might look at the object ids, which
are all different:
A third way for creating NumPy arrays is to call specific NumPy functions returning new arrays. From np.zeros
we get an array of zeros. From np.ones we get an array of ones. There are much more functions like zeros and
ones, see Array creation routines126 in Numpy’s documentation.
a = np.zeros(5)
b = np.ones((2, 3))
print(a, '\n')
print(b)
[0. 0. 0. 0. 0.]
[[1. 1. 1.]
[1. 1. 1.]]
126 https://numpy.org/doc/stable/reference/routines.array-creation.html#routines-array-creation
Objects of type ndarray have several member variables containing important information about the array:
• ndim: number of dimensions,
• shape: tuple of length ndim with array size in each dimension,
• size: total number of elements,
• nbytes: number of bytes occupied by the array elements,
• dtype: the array’s data type.
a = np.zeros((4, 3))
print(a.ndim)
print(a.shape)
print(a.size) # 4 * 3
print(a.nbytes) # 4 * 3 * 8
print(a.dtype)
2
(4, 3)
12
96
float64
It’s important to know that shape matters. In mathematics almost always we identify vectors with matrices having
only one column. But in NumPy these are two different things. A vector has shape (n, ), that is ndim is 1, whereas
a one-column matrix has shape (n, 1) with ndim of 2. Consequently, a vector neither is a row nor a column in
NumPy. It’s simply a list of numbers, nothing more.
a = np.zeros(5)
b = np.zeros((5, 1))
c = np.zeros((1, 5))
print(a, '\n')
print(b, '\n')
print(c)
[0. 0. 0. 0. 0.]
[[0.]
[0.]
[0.]
[0.]
[0.]]
[[0. 0. 0. 0. 0.]]
Elements of NumPy arrays can be accessed similarly to items of Python lists. That is, the first item in a one-
dimensional ndarray has index 0 and the last one has index ndarray.size. Slicing is allowed, too.
print(a[0], '\n')
print(a[1:3], '\n')
print(a[::2])
23
[42 -7]
[23 -7 10]
In case of multi-dimensional arrays we have to provide an index for each dimension. Slicing is done per dimension.
[[1 2 3]
[4 5 6]
[7 8 9]]
[[4 5]
[7 8]]
[[1 3]
[7 9]]
[4 5 6]
Note: Selecting all elements in the last dimensions like in a[1, :] can be abbreviated to a[1]. Same holds for
higher dimensions: a[1, 3, :, :] is equivalent to a[1, 3]. The drawback is that one doesn’t see immediately
the array’s dimensionality.
NumPy knows many different numerical data types. Often we do not have to care about types (NumPy will choose
suitable ones), but sometimes we have to specify data types explicitly (see examples below).
Almost all NumPy functions accept the keyword argument dtype to specify the data type of the function’s return
value. Either pass a string with the desired type’s name or pass a type object. Passing Python types like int makes
NumPy choose the most appropriate NumPy type (here, np.int64 or the string 'int64').
a = np.zeros((2, 3))
b = np.zeros((2, 3), dtype=np.int64)
print(a, '\n')
print(b)
[[0. 0. 0.]
[0. 0. 0.]]
[[0 0 0]
[0 0 0]]
Hint: The dtype member of ndarrays and the dtype argument to NumPy functions carry more information
than the bare type (e.g., ‘signed integer of length 64 bits’). They also contain information about how data is organized
in memory. This is important for efficient import of data from external sources. Details will be discussed in Saving
and Loading Non-Standard Data (page 181).
Working with large NumPy arrays we have to save memory wherever possible. One important ingredient for memory
efficiency is choosing small types, that is, types with small range. Often we work with arrays of zeros and ones or of
small integers only. Then we should choose the smallest integer type:
Having a data set with one billion numbers choosing the correct type decides about requiring 1 GB or 8 GB of
memory!
127 https://numpy.org/doc/stable/reference/arrays.scalars.html#built-in-scalar-types
Creating an array without explicitly providing a data type makes NumPy choose np.int64 or np.float64
depending on the presence of floats. This may lead to hard to find errors:
print(a)
[1 4 6 0]
Modifying values in integer arrays converts the new values to the array’s data type, even if information will be lost.
To avoid such errors always specify types if working with floats!
import numpy as np
All Python operators can be applied to NumPy arrays, where all operations work elementwise.
For instance, we can easily add two vectors or two matrices by using the + operator.
a = np.array([1, 2, 3])
b = np.array([9, 8, 7])
c = a + b
print(c)
[10 10 10]
Important: Because all operations work elementwise, the * operator on two-dimensional arrays does NOT multiply
the two matrices in the mathematical sense, but simply yields the matrix of elementwise products. Mathematical
matrix multiplication will be discussed in Linear Algebra Functions (page 177).
NumPy reimplements almost all mathematical functions, like sine, cosine and so on. NumPy’s functions take arrays as
arguments and apply the mathematical functions elementwise. Have a look at Mathematical functions128 in NumPy’s
documentation for a list of available functions.
a = np.array([1, 2 , 3])
b = np.sin(a)
print(b)
128 https://numpy.org/doc/stable/reference/routines.math.html
Hint: Functions min and amin are equivalent. The amin variant exists to avoid confusion and name conflicts with
Python’s built-in function min. Writing np.min is okay.
Important: Functions np.min and np.minimum do different things. With min we get the smallest value in an
array, whereas minimum yields the elementwise minimum of two equally sized arrays. With np.argmin we get
the index (not the value) of the minimal element of an array.
a = np.array([1, 2, 3])
b = np.array([-1, 3, 3])
print(a > b)
Comparisons result in NumPy arrays of data type bool. The function np.any returns True if and only if at least
one item of the argument is True. The function np.all returns True if and only if all items of the argument are
True.
print(np.any(a))
print(np.all(a))
True
False
Some of NumPy’s functions also are accessible as methods of ndarray objects. Examples are any and all:
print(a.any())
print(a.all())
True
False
Due to Python’s internal workings for processing logical expressions efficiently it’s not possible to redefine and
and friends via dunder methods. Thus, there’s no chance for NumPy to define it’s own variant of and. There is
a dunder method __and__, but that implements bitwise ‘and’ (Python operator &), which is something different
than logical ‘and’.
To combine several conditions you might use logical_and129 and friends. Using Python’s and (and friends)
results in an error, because Python tries to convert a NumPy array to a bool value and it’s not clear how to do this
(any or all?).
129 https://numpy.org/doc/stable/reference/generated/numpy.logical_and.html
16.2.3 Broadcasting
If dimensions of the operands of a binary operation do not fit (short vector plus long vector, for instance) an exception
is raised. But in some cases NumPy uses a technique called broadcasting to make dimensions fit by cloning suitable
subarrays. Examples:
a = np.array([[1, 2, 3]]) # 1 x 3
b = np.ones((4, 3)) # 4 x 3
c = a + b
print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)
[[1 2 3]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]]
(4, 3)
a = np.array([[1, 2, 3]]) # 1 x 3
b = np.array([[1], [2], [3], [4]]) # 4 x 1
c = a + b
print(a, '\n')
print(b, '\n')
print(c, '\n')
print(c.shape)
[[1 2 3]]
[[1]
[2]
[3]
[4]]
[[2 3 4]
[3 4 5]
[4 5 6]
[5 6 7]]
(4, 3)
c = a + b
print(a, '\n')
(continues on next page)
[1 2 3]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]
[2. 3. 4.]]
(4, 3)
c = a + b
print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)
[[1 2 3]
[4 5 6]]
[[ 8 9 10]
[11 12 13]]
(2, 3)
On the other hand broadcasting allows for efficient column or row operations:
c = a * b
print(a, '\n')
print(b, '\n')
print(c)
print(c.shape)
[[1 2 3]
[4 5 6]]
[[0.5]
[2. ]]
[[ 0.5 1. 1.5]
[ 8. 10. 12. ]]
(2, 3)
NumPy supports different indexing techniques for accessing subarrays. We already discussed list-like indexing. Now
we add boolean and integer indexing.
import numpy as np
If the index to an array is a boolean array of the same shape as the indexed array, then a one-dimensional array of all
items where the index is True is returned.
b = a[idx]
print(a, '\n')
print(idx, '\n')
print(b)
[[1 2 3]
[4 5 6]
[7 8 9]]
[1 2 5 6 7]
a = np.array([1, 4, 3, 5, 7, 6, 3, 2, 4, 5, 6, 7, 4, 1, 9])
b = a[a > 3]
print(b)
[4 5 7 6 4 5 6 7 4 9]
130 https://numpy.org/doc/stable/user/basics.broadcasting.html
Here b is an array containing all numbers of a which are greater than 3. The comparison a > 3 returns a boolean
array of the same shape as a. Note, that broadcasting is used to compare an array to a number. The resulting boolean
array then is used as index to a.
Given an array we may provide an array of indices. The result of the corresponding indexing operation is an array of
the same size as the index array, but with the items of the indexed array at corresponding positions.
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
idx = np.array([[0, 5], [2, 3]])
print(a[idx])
[[1 6]
[3 4]]
For indexing multi-dimensional arrays we need multiple index arrays (one per dimension).
print(a[idx0, idx1])
[[1 3]
[8 7]]
16.4 Vectorization
NumPy is very fast if operations are applied to whole arrays instead of element-by-element. Thus, we should try to
avoid iterating through array elements and processing single elements. This idea is known as vectorization.
import numpy as np
Imagine we have a vector of length 𝑛, where 𝑛 is even. We would like to interchange each number at an even index
with its successor. The result shall be stored in a new array.
Here is the code based on a loop:
def interchange_loop(a):
result = np.empty_like(a)
for k in range(0, int(a.size / 2)):
result[2 * k] = a[2 * k + 1]
result[2 * k + 1] = a[2 * k]
return result
print(interchange_loop(np.array([1, 2, 3, 4, 5, 6, 7, 8])))
[2 1 4 3 6 5 8 7]
def interchange_vectorized(a):
result = np.empty_like(a)
result[0::2] = a[1::2]
result[1::2] = a[0::2]
return result
print(interchange_vectorized(np.array([1, 2, 3, 4, 5, 6, 7, 8])))
[2 1 4 3 6 5 8 7]
%%timeit
interchange_loop(np.zeros(1000))
178 µs ± 5.66 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
interchange_vectorized(np.zeros(1000))
2.96 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
The speed-up is of a factor of almost 75 for 𝑛 = 1000 and becomes much better if 𝑛 grows.
Given two lists of numbers we want to have a two-dimensional array containing all products from these numbers.
Here is the code based on a loop:
print(products_loop(range(1000), range(1000)))
print(products_vectorized(range(1000), range(1000)))
[[ 0 0 0 ... 0 0 0]
[ 0 1 2 ... 997 998 999]
[ 0 2 4 ... 1994 1996 1998]
...
[ 0 997 1994 ... 994009 995006 996003]
[ 0 998 1996 ... 995006 996004 997002]
[ 0 999 1998 ... 996003 997002 998001]]
Hint: The T member variable of a NumPy array provides the transposed array. It’s not a copy (expensive) but a
view (cheap). For detail see Linear Algebra Functions (page 177).
Execution times:
%%timeit
products_loop(range(1000), range(1000))
248 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
products_vectorized(range(1000), range(1000))
1.25 ms ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
16.4.3 Important
Whenever you see a loop in numerical routines, spend some time to vectorize it. Almost always that’s possible. Often
vectorization also increases readability of the code.
NumPy comes with lots of functions for manipulating arrays. Some of them are needed more often, others almost
never. A comprehensive list is provided in array manipulation routines131 . Here we only mention some of the more
important ones.
import numpy as np
131 https://numpy.org/doc/stable/reference/routines.array-manipulation.html
A NumPy array’s reshape132 method yields an array of different shape, but with identical data. The new array has
to have the same number of elements as the old one.
a = np.ones(5) # 1d (vector)
b = a.reshape(1, 5) # 2d (row matrix)
c = a.reshape(5, 1) # 2d (column matrix)
print(a, '\n')
print(b, '\n')
print(c)
[1. 1. 1. 1. 1.]
[[1. 1. 1. 1. 1.]]
[[1.]
[1.]
[1.]
[1.]
[1.]]
One dimension may be replaced by -1 indicating that the size of this dimension shall be computed by NumPy:
a = np.ones((8, 8))
b = a.reshape(4, -1)
print(a.shape, b.shape)
To mirrow a 2d array on its vertical or horizontal axis use fliplr133 and flipud134 , respectively.
print(a, '\n')
print(b)
[[1 2 3]
[4 5 6]]
[[3 2 1]
[6 5 4]]
132 https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
133 https://numpy.org/doc/stable/reference/generated/numpy.fliplr.html
134 https://numpy.org/doc/stable/reference/generated/numpy.flipud.html
Arrays of identical shape (except for one axis) may be joined along an existing axis to one large array with con-
catenate135 .
a = np.ones((2, 3))
b = np.zeros((2, 5))
c = np.full((2, 2), 5)
d = np.concatenate((a, b, c), axis=1)
print(d)
[[1. 1. 1. 0. 0. 0. 0. 0. 5. 5.]
[1. 1. 1. 0. 0. 0. 0. 0. 5. 5.]]
If identically shaped array shall be joined along a new axis, use stack136 .
a = np.ones(2)
b = np.zeros(2)
c = np.full(2, 5)
d = np.stack((a, b, c), axis=1)
print(d)
[[1. 0. 5.]
[1. 0. 5.]]
Like Python lists NumPy arrays may be extended by appending further data. The append137 method takes the
original array and the new data and returns the extended array.
a = np.ones((3, 3))
b = np.append(a, [[1, 2, 3]], axis=0)
print(b)
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]
[1. 2. 3.]]
135 https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
136 https://numpy.org/doc/stable/reference/generated/numpy.stack.html
137 https://numpy.org/doc/stable/reference/generated/numpy.append.html
NumPy arrays may be very large. Thus, having too many copies of one and the same array (or subarrays) is expensive.
NumPy implements a mechanism to avoid copying arrays if not necessary by sharing data between arrays. The
programmer has to take care of which arrays share data und which arrays are independent from others.
import numpy as np
16.6.1 Views
A view of a NumPy array is a usual ndarray object, that shares data with another array. The other array is called
the base array of the view.
Views can be created with an array’s view method. The base object is accessible through a view’s base member
variable.
a = np.ones((100, 100))
b = a.view()
id of a: 139825110801392
id of b: 139825110801776
base of a: None
id of base of b: 139825110801392
The view method is rarely called directly (might be used for type conversions without copying), but views frequently
originate from calling shape manipulation functions like reshape or fliplr:
b = a.reshape(10, 1000)
c = np.fliplr(a)
print(a.shape)
print(b.shape, b.base is a)
print(c.shape, c.base is a)
(100, 100)
(10, 1000) True
(100, 100) True
Operations on views alter the base array’s (and other view’s) data:
b[0, 0] = 5
Important: Writing data to views modifies the base array! This is a common source of errors, which are very hard
to track down. Always keep track of which of your arrays are views!
Views may be smaller than the original array. Such views of subarrays originate from slicing operations:
a = np.ones((100, 100))
b = a[4:10, :]
c = a[5]
print(a.shape)
print(b.shape, b.base is a)
print(c.shape, c.base is a)
(100, 100)
(6, 100) True
(100,) True
b[1, 0] = 5
16.6.3 Copies
a = np.ones((100, 100))
b = a.copy()
b[0, 0] = 5
1.0 5.0
None
Hint: NumPy arrays are mutable objects. Thus, assigning a new name to an array or passing an array to a function
does not copy the array. Keeping this in mind is very important because functions you call in your code may alter
you arrays. The other way round, writing functions other people might use, clearly indicate in the documentation if
your function modifies arrays passed as parameters. If in doubt, use copy.
When working with large arrays the most expensive operation (longest execution time) is copying arrays. Thus, we
should avoid making copies of arrays. But there are some more things to consider when optimizing execution time
and memory consumption.
import numpy as np
Sometimes data comes in in chunks and we have to build a large array step by step. We could start with an empty
array and append each new chunk of data. If incoming chunks are single numbers, code could look as follows:
%%timeit
a = np.array([], dtype=np.int64)
396 µs ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Each call to append creates a new (larger) array and copies the existing one into the new one. In the end we made
100 expensive copy operations.
If we know the final size of our array in advance, then we should create an array of final size before filling it with
data:
%%timeit
a = np.empty(100, dtype=np.int64)
9.12 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
If data comes in in chunks and we do not know the final array’s size in advance, we should use a Python list for
temporarily storing data. Appending to a Python list is cheap, because existing list data won’t be copied. Each list
item has its own (more or less random) location in memory. If data is complete, we create a NumPy array of correct
size and copy the list’s items to the array.
%%timeit
a = []
for k in range(0, 100):
a.append(k)
b = np.array(a, dtype=np.int64)
11.5 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
37.3 µs ± 5.83 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
13.2 µs ± 344 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Working with large arrays we should free memory as soon as possible (use del). A non-obvious situation, where
memory can be freed, is when using small views of large arrays. Consider the following code:
Here the large array remains in memory although we only need the first row. Because the view b is based on the array
object a, del a only removes the name a, but garbage collection cannot remove the array object. More efficient
code:
Here only the first (copied) row remains in memory. The original large array will be removed from memory by
Python’s garbage collection as soon as possible.
NumPy does not raise an exception if we use mathematical operations which lead to undefined results. Instead,
NumPy prints a warning and returns one of several special floating point numbers.
import numpy as np
16.8.1 Infinity
Results of some undefined operations can be interpreted as plus or minus infinity. NumPy represents infinity by the
special float np.inf:
a = np.array([1, -1])
b = a / 0
print(b)
print(type(b[0]))
[ inf -inf]
<class 'numpy.float64'>
b = a / 0
a = np.inf
print(a)
inf
Also well defined operations may lead to infinity if the range of floats is exhausted:
print(np.array([1.23]) ** 10000)
[inf]
print(np.array([1.23]) ** 10000)
print(np.inf + 1)
print(5 * np.inf)
print(0 * np.inf)
print(np.inf - np.inf)
inf
inf
nan
nan
Results of undefined operations which cannot be interpreted as infinity are represented by the special float np.nan
(not a number).
a = np.array([-1, 1])
b = np.log(a)
print(b)
[nan 0.]
b = np.log(a)
print(np.nan + 1)
print(5 * np.nan)
print(0 * np.nan)
nan
nan
nan
Don’t use usual comparison operators for testing for special floats. They show strange (but well-defined) behavior:
print(np.nan == np.nan)
print(np.inf == np.inf)
False
True
Instead, call np.isnan138 or np.isposinf139 (for +∞) or np.isneginf140 (for −∞) or np.isinf141 (for
both).
Comparisons between finite numbers and np.inf are okay:
True
138 https://numpy.org/doc/stable/reference/generated/numpy.isnan.html
139 https://numpy.org/doc/stable/reference/generated/numpy.isposinf.html
140 https://numpy.org/doc/stable/reference/generated/numpy.isneginf.html
141 https://numpy.org/doc/stable/reference/generated/numpy.isinf.html
Some NumPy function come in two variants, which differ in handling np.nan. Look at amin142 and nanmin143 , for
instance.
NumPy has a submodule linalg for functions related to linear algebra, but the base numpy module also con-
tains some linear algebra functions. Linear algebra144 in NumPy’s documentation provides a comprehensive list of
functions. Here we only provide few examples.
import numpy as np
For inner products use np.inner145 . For outer products use np.cross146 .
a = np.array([1, 2, 3])
b = np.array([1, 0, 2])
print(np.inner(a, b))
print(np.cross(a, b))
7
[ 4 1 -2]
The np.transpose147 function yields the transpose of a matrix. Alternatively, a NumPy array’s member variable
T holds the transpose, too. The transpose is a view (not a copy) of the original matrix.
A = np.array([[1, 2, 3],
[4, 5, 6]])
print(np.transpose(A))
print(A.T)
[[1 4]
[2 5]
[3 6]]
[[1 4]
[2 5]
[3 6]]
142 https://numpy.org/doc/stable/reference/generated/numpy.amin.html
143 https://numpy.org/doc/stable/reference/generated/numpy.nanmin.html
144 https://numpy.org/doc/stable/reference/routines.linalg.html
145 https://numpy.org/doc/stable/reference/generated/numpy.inner.html
146 https://numpy.org/doc/stable/reference/generated/numpy.outer.html
147 https://numpy.org/doc/stable/reference/generated/numpy.transpose.html
NumPy introduces the @ operator for matrix multiplication. It’s equivalent to calling np.matmul148 .
A = np.array([[1, 2, 3],
[4, 5, 6]])
B = np.array([[1, 0],
[2, 1],
[1, 1]])
print(A @ B)
print(np.matmul(A, B))
[[ 8 5]
[20 11]]
[[ 8 5]
[20 11]]
Determinants and inverses of square matrices can be computed with np.linalg.det149 and np.linalg.
inv150 , respectively.
A = np.array([[2, 0],
[1, 1]])
print(np.linalg.det(A))
print(np.linalg.inv(A))
2.0
[[ 0.5 0. ]
[-0.5 1. ]]
A = np.array([[2, 0],
[1, 1]])
b = np.array([2, 3])
print(np.linalg.solve(A, b))
[1. 2.]
148 https://numpy.org/doc/stable/reference/generated/numpy.matmul.html
149 https://numpy.org/doc/stable/reference/generated/numpy.linalg.det.html
150 https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
151 https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html
In data science contexts random numbers are important for simulating data and for selecting random subsets of data.
NumPy provides a submodule random for creating arrays of (pseudo-)random numbers.
import numpy as np
NumPy supports several different algorithms for generating random numbers. For our purposes choice of algorithm
does not matter (for crypto applications it matters!). Luckily NumPy provides a default one.
We first have to create a random number generator object (or get the default one) and initialize it with a seed. The
seed determines the sequence of generated random numbers. Using a fixed seed is important if we need reproducable
results (when testing things, for instance).
Random numbers may follow different distributions. NumPy provides many standard distributions, see Random
Generator152 in NumPy’s documentation.
print(a)
[[23 35 34 24 40 27 27 26 29 26]
[29 38 31 40 31 28 37 38 39 39]
[23 32 28 27 27 38 38 27 30 37]
[25 34 31 40 37 27 38 38 27 32]]
print(a)
# permutation of an array
a = np.array([1, 2, 3, 4 ,5])
b = rng.permutation(a)
print(b)
[4 2 5 3 1]
152 https://numpy.org/doc/stable/reference/random/generator.html#simple-random-data
SEVENTEEN
There exist Python modules for almost all standard file formats. Readers and writers for several formats also are
included in larger packages like matplotlib, opencv, pandas. To share data with others always use some
standard file format (PNG or JPEG for images, CSV for tabulated data, and so one).
For storing temporary data like interim results NumPy and the pickle module from Python’s standard library
provide very convenient quick-and-dirty functions. Next to those functions, in this chapter we also discuss how to
read custom binary file formats.
Related projects:
• MNIST Character Recognition (page 311)
– The xMNIST Family of Data Sets (page 311)
– Load QMNIST (page 313)
NumPy provides functions for saving arrays to files and for loading arrays from files.
import numpy as np
a = np.array([1, 2, 3])
np.save('some_array.npy', a)
The np.load154 functions reads an array from a file written with np.save:
a = np.load('some_array.npy')
print(a)
[1 2 3]
153 https://numpy.org/doc/stable/reference/generated/numpy.save.html
154 https://numpy.org/doc/stable/reference/generated/numpy.load.html
181
Data Science and Artificial Intelligence for Undergraduates
To save multiple arrays to one file use np.savez155 and provide each array as a keyword argument. The result is
the same as calling save and creating an uncompressed (!) ZIP archive containing all files. File names in the ZIP
archive correspond to keyword argument names.
a = np.array([1, 2, 3])
b = np.array([4, 5])
Use np.load156 to load multiple arrays written with savez. The returned object is dict-like, that is, it behaves
like a dictionary, but isn’t of type dict. Conversion to dict works as expected.
print(a)
print(b)
[1 2 3]
[4 5]
The pickle module provides functions for pickling (saving) and unpickling (loading) almost arbitrary Python objects
to and from files, respectively. For details on what objects are picklable see documentation of the pickle module158 .
import pickle
There exist two interfaces: either use the functions dump and load or create a Pickler and an Unpickler
object. Here we only discuss the former variant. For the latter see pickle module159 in Python’s documentation.
17.2.1 Pickling
some_object = [1, 2, 3, 4]
another_object = 'I\'m a string.'
17.2.2 Unpickling
print(some_object)
print(another_object)
[1, 2, 3, 4]
I'm a string.
Unpickling objects from unknown sources is a security risk. See pickle’s documentation162 .
If you have many objects to pickle, create a list of all objects and pickle the list. The advantage is, that for unpickling
you do not have to remember how many objects you have pickled. Simply unpickle the list and look at its length.
Sometimes data comes in custom binary formats for which no library functions exist. To read data from binary files
we have to know how to interpret the data. Which bytes represent text? Which bytes represent numbers? And so on.
Without format specification binary files are almost useless.
To view binary files use a hex editor. A hex editor shows a file byte by byte, where each byte is shown as two
hexadecimal digits. If you do not have a hex editor installed, try wxHexEditor163 .
Most binary files are composed of strings, bit masks, integers, floats, and padding bytes. The hex editor shows
common interpretations of bytes at current cursor position.
161 https://docs.python.org/3/library/pickle.html#pickle.load
162 https://docs.python.org/3/library/pickle.html
163 https://www.wxhexeditor.org
Fig. 17.1: A hex editor shows file contents in hexadecimal notation and as ASCII characters (right column) together
with common interpretations (lower panel).
We already discussed decoding binary data to strings in the chapter on Text Files (page 114). The only question is
how to find the end of a string. This question should be answered in the format specification. Usually string data is
terminated by a byte with value 0.
Bit masks are bytes in which each bit describes a truth value. To extract a bit from a byte all programming languages
provide bitwise operators. Here we interpret a byte as sequence of 8 bits. Following bitwise operations can be used:
• a & b returns 1 at a bit position if and only if a and b are both 1 at this position (bitwise and).
• a | b returns 1 at a bit position if and only if at least one of a and b is 1 at this position (bitwise or)
• a ^ b returns 1 at a bit position if and only if exactly one of a and b is 1 at this position (bitwise exclusive or)
• ~a returns 1 at a bit position if and only if a is 0 at this position (bitwise not)
Python implements these bitwise operators for signed integers, which results in somewhat unexpected results (but it’s
the only way since Python has no unsigned integers). Thus, better use NumPy’s types.
To read the third bit use & 0b00100000:
third_bit
True
To set the third bit to 1 (when writing binary files) use | 0b00100000.
bin(bit_mask)
'0b10111100'
To set the third bit to 0 (when writing binary files) use & ~0b00100000.
bin(bit_mask)
'0b10011100'
Integer values in a binary file may have different lengths, starting from 1 byte upto 8 byte. Reading a 1-byte-integer is
very simple. Just read the byte. For two-byte integers things become more involved. There is a first (closer to begin
of file) and a second byte and there is no universally accepted rule for converting two bytes to an integer. Denoting
the first byte by 𝑎 and the second by 𝑏 there are two possibilities:
• 𝑎 + 256 𝑏 (least significant byte first, little endian, Intel format)
• 256 𝑎 + 𝑏 (most significant byte first, big endian, Motorola format)
If we have 4-byte integers, the problem persists. With bytes 𝑎, 𝑏, 𝑐, 𝑑 we have
• 𝑎 + 256 𝑏 + 2562 𝑐 + 2563 𝑑 (little endian)
3 2
• 256 𝑎 + 256 𝑏 + 256 𝑐 + 𝑑 (big endian)
Analogously for 8-byte integers.
NumPy provides the fromfile164 function to read integers and other numeric data from binary files. Next to
offset (starting position) and count (number of items to read) it has a dtype keyword argument. Usual Python
and NumPy types are allowed, but more detailed type control is possible by providing a string consisting of:
• '<' (little endian) or '>' (big endian) and
• 'i' (signed integer) or 'u' (unsigned integer) and
• length of item in bytes.
164 https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html
Reading unsigned 32-bit integers in little endian notation would require '<u4', for instance.
If data is already in memory, use frombuffer165 instead of fromfile.
[200 3 4 5]
[-56 3 4 5]
[ 968 1284]
[51203 1029]
[3355640837]
[-939326459]
165 https://numpy.org/doc/stable/reference/generated/numpy.frombuffer.html
166 https://numpy.org/doc/1.19/user/basics.byteswapping.html
EIGHTEEN
The Pandas167 Python package makes NumPy’s efficient computing capabilities more accessible for data science
purposes and adds functionality for complex data types like timestamps, time periods and categories. Pandas provides
powerful data structures and functions for managing, transforming and analyzing data.
• Series (page 188)
• Data Frames (page 198)
• Advanced Indexing (page 207)
• Dates and Times (page 217)
• Categorical Data (page 223)
• Restructuring Data (page 227)
• Performance Issues (page 233)
Related exercises:
• Pandas Basics (page 273)
• Pandas Indexing (page 276)
• Advanced Pandas (page 278)
• Pandas Vectorization (page 281)
Related projects:
• Public Transport (page 317)
– Get Data and Set Up the Environment (page 317)
– Find Connections (page 322)
• Corona Deaths (page 325)
• Weather (page 305)
– Climate Change (page 309)
167 https://pandas.pydata.org/
187
Data Science and Artificial Intelligence for Undergraduates
18.1 Series
Pandas Series is one of two fundamental Pandas data types (the other is DataFrame). A Series object holds
one-dimensional data, like a list, but with more powerful indexing capabilities. Data is stored in an underlying one-
dimensional NumPy array. Thus, most operations are much more efficient than with lists.
import pandas as pd
A Series object can be created from a Python list or a dictionary, for instance. See Series constructor168 in Pandas’
documentation.
0 23
1 45
2 67
3 78
4 90
dtype: int64
a 12
b 23
c 45
d 67
dtype: int64
A Series consists of an index (first column printed) and its data (second column printed). All data items have to
be of identical type. The length of a Series is provided by the size member variable (you may also use Python’s
built-in function len).
s.size
Data in a Series behaves like a one-dimensional ndarray, but Pandas’ indexing mechanisms make things different
from NumPy. Pandas implements automatic data alignment. That is, data items do not have fixed positions like in a
NumPy array. Instead, only the (possibly non-integer) index matters. Here is a first example:
168 https://pandas.pydata.org/docs/reference/api/pandas.Series.html
a 2
b 4
c 3
d 6
dtype: int64
a 1
b 5
d 7
e 9
dtype: int64
a 3.0
b 9.0
c NaN
d 13.0
e NaN
dtype: float64
Both series have indices a, b, d. Thus, addition is defined. But c and e appear only in one of the series. Addition
fails and the result is not a number.
Important: Note that data type now is float although every number is an integer. The reason is, that integers do not
allow to represent the float NaN. Thus, Pandas has to change to data type of the result. We will come back to such
NaN problems later on.
If we had used NumPy, then the result would be the sum of two vectors:
import numpy as np
a = np.array([2, 4, 3, 6])
b = np.array([1, 5, 7, 9])
a + b
Index and data are accessible via index and array members of Series objects:
print(s.index, '\n')
print(s.array, '\n')
print(type(s.index), '\n')
print(type(s.array))
<PandasArray>
[23, 45, 67, 78, 90]
Length: 5, dtype: int64
<class 'pandas.core.indexes.range.RangeIndex'>
(continues on next page)
<class 'pandas.core.arrays.numpy_.PandasArray'>
The index member is one of several index types. Index objects will be discussed later on. The array member is
an array type defined by Pandas. If we want to have a NumPy array, we should call to_numpy():
a = s.to_numpy()
print(a, '\n')
print(type(a))
[23 45 67 78 90]
<class 'numpy.ndarray'>
18.1.4 Indexing
Accessing single items or subsets of a series works more or less the same way as for lists or dictionaries or NumPy
arrays.
The flexibility of Pandas’ multiple-items indexing mechanisms sometimes leads to confusion and unexpected erros.
In addition, some features are not well documented and a transition to more predictable and more clearly structured
indexing behavior is in progress.
Overview
There exist four widely used indexing mechanisms (here s is some series):
• s[...]: Python style indexing
• s.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)
• s.loc[...] and s.iloc[...]: new Pandas style indexing
• s.at[...] and s.iat[...]: new Pandas style indexing for more efficient access to single items
Deprecated Indexing
Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing and label
based indexing. Position based means that, like for NumPy arrays, we refer to an item by its position in the series.
The first item has position 0. Thus, the series’ index object is completely ignored. Providing an item of the series’
index member as index, is refered to as label based indexing.
Both [...] and ix[...] behave slightly differently when using slicing. A major problem is that sometimes it
is not clear whether positional or label based indexing shall be used. Consider a series with an index made of id
numbers, that is, integers:
123 3
45 4
542 7
2 19
dtype: int64
19
123 3
45 4
dtype: int64
Without knowing the exact mechanism behind [...], which in fact calls the series’ __getitem__ method, code
becomes unreadable. Same is true for ix. The ix indexer has been removed from Pandas since version 1.0.0
(January 2020). Indexing with [...] is still available, but should be avoided, at least for series with integer labels.
Prefered indexing is via loc[...] and iloc[...], the first for label based indexing, the second for positional
indexing. Positional indexing is also known as integer indexing, thus the i in iloc. Slicing and boolean indexing are
supported (see below).
If only a single item shall be accessed, then loc[...] and iloc[...] might be too slow due to the implemen-
tation of complex features like slicing. For single item access one should use at[...] and iat[...] providing
label based and positional indexing, respectively.
Positional Indexing
Positional indexing via iloc[...] or iat[...] works like for one-dimensional NumPy arrays.
a 1
b 2
c 3
d 4
e 5
dtype: int64
b 2
(continues on next page)
d 4
a 1
c 3
dtype: int64
a 1
d 4
e 5
dtype: int64
An important difference to NumPy indexing is, that the result is a series again. That is, the index of the selected items
is returned, too.
Label based indexing works like with dictionaries. But slicing is allowed.
a 1
b 2
c 3
d 4
e 5
dtype: int64
b 2
c 3
d 4
dtype: int64
d 4
a 1
c 3
dtype: int64
a 1
d 4
e 5
dtype: int64
Important: Note that slicing with labels includes the stop item!
Different items with identical labels are allowed. In such case loc[...] returns all items with the specified label
and at[...] returns an array of all values with the specified label.
print(s.loc['b'], '\n')
print(s.at['b'])
a 1
b 2
b 3
c 4
dtype: int64
b 2
b 3
dtype: int64
b 2
b 3
dtype: int64
Indexing by Callables
Both loc[...] and iloc[...] accept a function as their argument. The function has to take a series as argument
and has to return something allowed for indexing (list of indices/labels, boolean array and so on).
Scenarios justifying indexing by callables are relatively complex.
As for NumPy arrays, indexing Pandas series may return a view of the series. That is, modifying the extracted subset
of items might modify the original series. If you really need a copy of the items, use the copy169 method of Series
objects.
A full list of member functions for Series objects170 is provided in Pandas’ documentation. Here we only list a few
of them.
169 https://pandas.pydata.org/docs/reference/api/pandas.Series.copy.html
170 https://pandas.pydata.org/docs/reference/series.html
If a series is read from a file we would like to get some basic information about the series.
With describe171 we get statistical information about a series. The function returns a Series object containing
the collected information.
First and last items are returned by head172 and tail173 , respectively. Both take an optional argument specifying
the number of items to return. Default is 5.
print(s.describe(), '\n')
print(s.head(), '\n')
print(s.tail(3))
count 10.000000
mean 3.200000
std 2.250926
min -2.000000
25% 2.250000
50% 3.500000
75% 4.750000
max 6.000000
dtype: float64
0 2
1 4
2 6
3 5
4 4
dtype: int64
7 3
8 2
9 5
dtype: int64
Note that we did not specify labels explicitly. Thus, the Series constructor uses item positions as labels.
Iterating over the values of a series works like for Python lists:
for i in s:
print(i)
2
4
6
5
4
3
-2
(continues on next page)
171 https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html
172 https://pandas.pydata.org/docs/reference/api/pandas.Series.head.html
173 https://pandas.pydata.org/docs/reference/api/pandas.Series.tail.html
0 2
1 4
2 6
3 5
4 4
5 3
6 -2
7 3
8 2
9 5
If next to labels also positional indices are required use an additional enumerate:
0 0 2
1 1 4
2 2 6
3 3 5
4 4 4
5 5 3
6 6 -2
7 7 3
8 8 2
9 9 5
Vectorized Operators
Like NumPy arrays Pandas series implement most mathematical and comparison operators.
a = pd.Series([1, 2, 3, 4])
b = pd.Series([4, 0, 6, 3])
print(a * b, '\n')
print(a < b)
0 4
1 0
2 18
3 12
dtype: int64
0 True
1 False
(continues on next page)
174 https://pandas.pydata.org/docs/reference/api/pandas.Series.items.html
Hint: Remember that Pandas uses data alignment, that is, labels matter, positions are irrelevant.
Functions all175 and any176 for boolean series are available, too.
print(s.all())
print(s.any())
False
True
With drop177 we can remove items from a series. Simply pass a list of labels to the function.
t = s.drop([3, 4, 5])
print(t)
0 2
1 4
2 6
3 5
4 4
5 3
6 -2
7 3
8 2
9 5
dtype: int64
0 2
1 4
2 6
6 -2
7 3
8 2
9 5
dtype: int64
175 https://pandas.pydata.org/docs/reference/api/pandas.Series.all.html
176 https://pandas.pydata.org/docs/reference/api/pandas.Series.any.html
177 https://pandas.pydata.org/docs/reference/api/pandas.Series.drop.html
178 https://pandas.pydata.org/docs/reference/api/pandas.concat.html
c = pd.concat([a, b])
print(a, '\n')
print(b, '\n')
print(c)
a 1
b 2
c 3
d 4
dtype: int64
d 0
e 5
f 6
g 7
dtype: int64
a 1
b 2
c 3
d 4
d 0
e 5
f 6
g 7
dtype: int64
Note that there is no check on duplicate index labels, since duplicates are no problem (see above).
A Pandas data frame (a DataFrame object) is a collection of Pandas series with common index. Each series can
be interpreted as a column in a two-dimensional table. There is a second index object for indexing columns.
Data frames are the most important data structure provided by Pandas.
import pandas as pd
A DataFrame object can be created from lists/dictionaries of lists/dictionaries or from NumPy arrays or from
lists/dictionaries of series, for instance. See DataFrame constructor183 in Pandas’ documentation. If necessary row
labels and column labels can be provided via index and columns keyword arguments, respectively.
df
left right
a 1.0 NaN
b 2.0 3.0
c NaN 4.0
df
Hint: JupyterLab shows data frames as graphical table. In other words, Jupyter’s display function yields output
different from print. With print we get a text representation of the data frame.
The shape member contains a tuple with number of rows and columns in the data frame:
df.shape
(3, 3)
183 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Like for series data frames implement data alignment where possible. This applies to row indexing as well as to
column indexing.
display(df1)
display(df2)
display(df1 + df2)
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A C D
b 11 12 13
c 14 15 16
d 17 18 19
A B C D
a NaN NaN NaN NaN
b 15.0 NaN 18.0 NaN
c 21.0 NaN 24.0 NaN
d NaN NaN NaN NaN
The index object for row indexing is accessible via index member and the index object for column indexing is
accessible via columns member.
Data frames also have a to_numpy184 method returning a data frame’s data as NumPy array.
18.2.4 Indexing
Indexing data frames is very similar to indexing series. The major difference is that for data frames we have to specify
a row and a column instead of only a row.
Overview
There exist four widely used mechanisms (here df is some data frame):
• df[...]: Python style indexing
• df.ix[...]: old Pandas style indexing (removed from Pandas in January 2020)
• df.loc[...] and df.iloc[...]: new Pandas style indexing
• df.at[...] and df.iat[...]: new Pandas style indexing for more efficient access to single items
184 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
Python style indexing and old Pandas style indexing (the ix indexer) allow for position based indexing, label based
indexing and mixed indexing (position for row, label for column, and vice versa). Both [...] and ix[...] behave
slightly different. As already discussed for series, sometimes it is not clear whether positional or label based indexing
shall be used. Thus, [...] should be used with care. The ix indexer has been removed from Pandas in January
2020, but may appear in old code and documents.
Indexing with [...] is mainly used for selecting columns by label. For other purposes new Pandas style indexing
should be used.
s = df['A']
df_part = df[['B', 'C']]
display(s)
display(df_part)
A B C
a 1 2 3
b 4 5 6
c 7 8 9
a 1
b 4
c 7
Name: A, dtype: int64
B C
a 2 3
b 5 6
c 8 9
If only one label is provided, then the result is a Series object. If a list of labels is provided, then the result is a
DataFrame object. In case of integer labels, label based indexing is used when selecting columns:
2 23 45
a 1 2 3
b 4 5 6
c 7 8 9
a 1
b 4
c 7
Name: 2, dtype: int64
Hint: Label based column indexing can be used to create new columns: df['my_new_column'] = 0 creates
Positional Indexing
Positional indexing via iloc[...] or iat[...] works like for two-dimensional NumPy arrays.
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
b 4 5
c 7 8
C B A
b 6 5 4
a 3 2 1
B C
a 2 3
c 8 9
Variants (slicing, boolean and so on) may be mixed for rows and columns.
Label based indexing works like with dictionaries. But slicing is allowed.
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
c 7 8
b 4 5
B A
c 8 7
a 2 1
B C
a 2 3
c 8 9
Important: Like for series label based slicing includes the stop label!
Indexing by Callables
Both loc[...] and iloc[...] accept a function for row index and column index. The function has to take a
data frame as argument and has to return something allowed for indexing (list of indices/labels, boolean array and so
on).
Indexing with [...] allows for mixing positional and label based indexing. But [...] should be avoided as
discussed for series. The only exception is column selection. With Pandas’ new indexing style mixed indexing is still
possible but requires some more bytes of code:
display(df.loc['a':'b', df.columns[0:2]])
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A B
a 1 2
b 4 5
The idea is to use loc[...] and the columns member variable. With columns we get access to the index
object for column indexing. This index object allows for usual positional indexing techniques (slicing, boolean and
so on). The result of indexing an index object is an index object again. But the new index object only contains the
desired subset of indices. This smaller index object than is passed to loc[...]. Same is possible for rows via
index member.
Like for NumPy arrays indexing Pandas data frames may return a view of the data frame. That is, modifying the
extracted subset of items might modify the original data frame. If you really need a copy of the items, use the copy185
method of DataFrame objects.
A full list of member functions for DataFrame objects186 is provided in Pandas’ documentation. Here we only list
a few.
With describe187 we get basic statistical information about each column holding numerical values. The function
returns a DataFrame object containing the collected information. Only columns with numerical data ae considered
by describe.
First and last rows are returned by head188 and tail189 . Both take an optional argument specifying the number of
rows to return. Default is 5.
The info190 method prints memory usage and other useful information.
display(df.describe())
display(df.head(2))
display(df.tail(2))
df.info()
A B C D
a 1 2 3 some
b 4 5 6 string
c 7 8 9 here
A B C
count 3.0 3.0 3.0
mean 4.0 5.0 6.0
std 3.0 3.0 3.0
min 1.0 2.0 3.0
25% 2.5 3.5 4.5
50% 4.0 5.0 6.0
75% 5.5 6.5 7.5
max 7.0 8.0 9.0
185 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html
186 https://pandas.pydata.org/docs/reference/dataframe.html
187 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
188 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
189 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html
190 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
A B C D
a 1 2 3 some
b 4 5 6 string
A B C D
b 4 5 6 string
c 7 8 9 here
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null int64
2 C 3 non-null int64
3 D 3 non-null object
dtypes: int64(3), object(1)
memory usage: 120.0+ bytes
To iterate over columns use items191 , which returns tuples containing the column label and the column’s data as
Series object.
A B C
a 1 2 3
b 4 5 6
c 7 8 9
A
a 1
b 4
c 7
Name: A, dtype: int64
B
a 2
b 5
c 8
Name: B, dtype: int64
C
a 3
b 6
c 9
Name: C, dtype: int64
191 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html
A
a 1
b 4
c 7
Name: A, dtype: int64
B
a 2
b 5
c 8
Name: B, dtype: int64
C
a 3
b 6
c 9
Name: C, dtype: int64
Iteration over rows can be implemented via iterrows192 method. Analogously to items it returns tuples con-
taining the row label and a Series object with the data.
a
A 1
B 2
C 3
Name: a, dtype: int64
b
A 4
B 5
C 6
Name: b, dtype: int64
c
A 7
B 8
C 9
Name: c, dtype: int64
Hint: Data in each column of a data frame has identical type, but types in a row may differ (from column to column).
Thus, calling iterrows may involve type casting to get row data as Series object.
Important: Usually there is no need to iterate over rows, because Pandas provides much faster vectorized code for
almost all operations needed for typical data science projects.
192 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html
Vectorized Operators
The Pandas function read_csv199 reads a CSV file and returns a data frame.
The DataFrame method to_csv200 writes data to a CSV file.
Missing Values
Missing values are a common problem in data science. Pandas provides several functions and mechanisms for handling
missing values. Important functions:
• isna201 (return boolean data frame with True at missing value positions),
• notna202 (return boolean data frame with False at missing value positions),
• fillna203 (fill all missing values, different fill methods are provided),
• dropna204 (remove rows or columns containing missing values or consisting completely of missing values).
Details may be found in Pandas user guide205 .
193 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.concat.html
194 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
195 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html
196 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.combine.html
197 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
198 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html
199 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
200 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
201 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html
202 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notna.html
203 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
204 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
205 https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
One of Pandas’ most useful features is its powerful indexing mechanism. Here we’ll discuss several types of index
objects.
import pandas as pd
Series and data frames have one or two index objects, respectively. They are accessible via Series.index or
DataFrame.index and DataFrame.columns. An index object is a list-like object holding all row or column
labels.
To create a new index, call the constructor of the Index class:
18.3.1 Reindexing
An index object may be replaced by another one. We have to take care whether data alignment between old and new
index shall be applied or not.
Series and DataFrame objects provide the reindex206 method. This method takes an index object (or a list
of labels) and replaces the existing index by the new one. Data alignment is applied, that is, rows/columns with label
in the intersection of old and new index remain unchanged, but rows/columns with old label not in the new index are
dropped. If there are labels in the new index which aren’t in the old one, then rows/columns are filled with nan or
some specified value or by some more complex filling logic.
a 123
b 456
e 789
dtype: int64
a 123.0
b 456.0
c NaN
d NaN
e 789.0
dtype: float64
206 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html
a 123
b 456
e 789
dtype: int64
a 123
b 456
c 0
d 0
e 789
dtype: int64
a 123
b 456
e 789
dtype: int64
a 123
b 456
c 789
d 789
e 789
dtype: int64
The align207 method reindexes two series/data frames such that both have the same index.
a 123
b 456
e 789
dtype: int64
(continues on next page)
207 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.align.html
a 98
c 76
e 54
dtype: int64
a 123.0
b 456.0
c NaN
e 789.0
dtype: float64
a 98.0
b NaN
c 76.0
e 54.0
dtype: float64
To simply replace an index without data alignment, that is to rename all the labels, there are two variants:
• replace the index object by a new one of same length via usual assignment,
• use an existing column as index.
a 123
b 456
e 789
dtype: int64
aa 123
bb 456
cc 789
dtype: int64
To use a column of a data frame as index call the set_index208 method and provide the column label.
df = df.set_index('A')
df
A B C
a 1 2 3
b 4 5 6
c 7 8 9
208 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
B C
A
1 2 3
4 5 6
7 8 9
To convert the index to a usual column call reset_index209 . The index will be replaced by the standard index
(integers starting at 0).
df = df.reset_index()
df
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Index objects may be shared between several series or data frames. Simply pass the index of an existing series or data
frame to the constructor of a new series or data frame or assign it directly or use reindex.
s1.index is s2.index
True
Note: Reindexing two series/data frames with align (see above) results in a shared index.
To append data to a series or data frame we may use label based indexing:
s.loc['e'] = 789
s
a 123
b 456
dtype: int64
a 123
b 456
e 789
dtype: int64
209 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
df.loc['c', 'D'] = 5
df
A B
a 1 2
b 3 4
A B D
a 1.0 2.0 NaN
b 3.0 4.0 NaN
c NaN NaN 5.0
Pandas’ standard index is of type RangeIndex210 . It’s used whenever a series or a data frame is created without
specifying an index.
for k in index:
print(k)
5
7
9
11
13
15
17
19
The IntervalIndex211 class allows for imprecise indexing. Each item in a series or data frame can be accessed
by any number in a specified interval.
210 https://pandas.pydata.org/docs/reference/api/pandas.RangeIndex.html
211 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.html
[2.0, 3.0) 23
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
print(s.loc[2], '\n')
# print(s.loc[3], '\n') # KeyError
print(s.loc[2.5], '\n')
print(s.loc[6.7])
23
23
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
Indexing by intervals:
23
IntervalIndex objects provide overlaps212 and contains213 methods for more flexible indexing:
[2.0, 3.0) 23
[6.0, 7.0) 45
dtype: int64
mask = s.index.contains(6.7)
print(mask)
s.loc[mask]
[6.0, 7.0) 45
[6.5, 9.0) 67
dtype: int64
Note: In principle contains should work with intervals instead of concrete numbers, too. But in Pandas 1.5.1
NotImplementedError is raised.
212 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.overlaps.html
213 https://pandas.pydata.org/docs/reference/api/pandas.IntervalIndex.contains.html
Up to some details, multi-level indexing is indexing using tuples as labels. Corresponding indexing objects are of
type MultiIndex214 . Major application of multi-level indices are indices representing several dimensions. Thus,
high dimensional data can be stored in a two-dimensional data frame.
Let’s start with a two-level index. First level contains courses of studies provided by a university. Second level contains
some lecture series. Data is the number of students from each course attending a lecture and the average rating for
each lecture.
Using the MultiIndex constructor is the most general, but not very straight forward way to create a multi-level
index. We have to provide lists of labels for each level. In addition, we need lists of codes for each level indicating
which label to use at each position. Each level may have a name.
courses_codes = [0, 0, 0, 1, 1, 1, 2, 2, 2]
lectures_codes = [0, 1, 2, 0, 1, 2, 0, 1, 2]
df
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Epistemology 8 3.6
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Epistemology 89 1.1
Hint: The mentioned creation methods are static methods. See Types (page 80) for some explanation of the
concept.
The above index contains each combination of items from two lists. Thus from_product is applicable:
214 https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html
215 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_arrays.html
216 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_tuples.html
217 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html
218 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html
df
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Epistemology 8 3.6
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Epistemology 89 1.1
df.index.levels
Note, that multi-level indexing is not restricted to row indexing. Multi-level column indexing works in exactly the
same manner.
Accessing Data
Accessing data works as for other types of indices. Labels now are tuples containing one item per level. But there
exist additional techniques specific to multi-level indices.
Single Tuples
students 20.0
rating 3.0
Name: (Physics, Computer Science), dtype: float64
df.iloc[1, 1]
1.3
students rating
course lecture
Physics Computer Science 20 3.0
Mathematics Epistemology 8 3.6
A new feature specific to multi-level indexing is slicing inside tuples. We would expect notation like
('Physics', :)
to get all rows with Physics at first level. But usual slicing syntax is not available here. Instead we have to use the
built-in slice function. It takes start, stop and step values (start and step default to None) and returns a slice
object. More precisely, the slice function is the constructor for slice objects. A slice object simply holds three
values (start, stop, step).
df.loc[('Physics', slice(None)), :]
students rating
course lecture
Physics Computer Science 20 3.0
Mathematics 17 1.6
Epistemology 3 4.7
df = df.sort_index()
df.loc[(slice('Mathematics', 'Physics'), 'Epistemology'), :]
students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7
Note that label based slicing above requires a sorted index. Thus, we have to call sort_index219 first.
An alternative to slice is creating a pd.IndexSlice object, which allows for natural slicing syntax:
df.loc[pd.IndexSlice['Mathematics':'Physics', 'Epistemology'], :]
students rating
course lecture
Mathematics Epistemology 8 3.6
Philosophie Epistemology 89 1.1
Physics Epistemology 3 4.7
219 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html
The get_level_values220 method of an index object takes a level name or level index as argument and returns
a simple Index object containing only the labels at the specified level. This object can then be used to create a
boolean array for row indexing.
import numpy as np
df.loc[mask, :]
students rating
course lecture
Mathematics Computer Science 10 2.1
Mathematics 15 1.3
Philosophie Computer Science 2 3.9
Mathematics 1 4.9
Comparing an index object with a single value results in a one-dimensional boolean NumPy array with same length
as the index object. NumPy’s logical_and method implements elementwise logical and.
Cross-Sections
There’s a shorthand for selecting all rows with a given label at a given level: xs221 . This method takes a label and a
level and returns corresponding rows as data frame.
df.xs('Mathematics', level='lecture')
students rating
course
Mathematics 15 1.3
Philosophie 1 4.9
Physics 17 1.6
Many Pandas functions accept a level keyword argument (like xs above or drop222 ) to provide functionality
adapted to multi-level indices.
220 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.get_level_values.html
221 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.xs.html
222 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
There are many different styles for using multi-level indexing. Some of them are very confusing for beginners,
because same syntax may have different semantics depending on the objects (row/column label, tuple, list) passed as
arguments. Here we only considered save and syntactically clear variants. To get an idea of other indexing styles have
a look at MultiIndex / advanced indexing223 in the Pandas user guide.
We already met the datetime module in Web Access (page 122) for handling points of time and time durations.
Pandas extends those capabilities by introducing time periods (durations associated with a point) and more advanced
calendar arithmetics.
Pandas also provides date and time related index objects to easily index time series data: DatetimeIndex,
TimedeltaIndex, PeriodIndex.
import pandas as pd
The basic data structure for representing points in time are Timestamp224 objects. They provide lots of useful
methods for conversion from and to other date and time formats.
Timestamp('2020-02-15 12:34:00')
The basic data structure for representing durations are Timedelta225 objects. They can be used in their own or to
shift time stamps.
18.4.3 Periods
A period in Pandas is a time interval paired with a time stamp. Interpretation is as follows:
• The interval is one of several preset intervals, like a calendar month or a week from Monday till Sunday. See
Offset aliases226 and Anchored aliases227 for available intervals.
• The time stamp selects a concrete interval, the month or the week containing the time stamp, for instance.
The basic data structure for representing periods are Period228 objects.
223 https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
224 https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html
225 https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html
226 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
227 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#anchored-offsets
228 https://pandas.pydata.org/docs/reference/api/pandas.Period.html
2010-01-01 00:00:00
2010-12-31 23:59:59.999999999
2009-11-01 00:00:00
2010-10-31 23:59:59.999999999
Using time stamps for indexing offers lots of nice features in Pandas, because Pandas originally has been developed
for handling time series.
The constructor for DatetimeIndex229 objects takes a list of time stamps and an optional frequency. Frequency
has to match the passed time stamps. If there is no common frequency in the data, the frequency is None (default).
A more convenient method is pd.date_range230 :
display(df)
pd.to_datetime(df)
229 https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
230 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
0 2018-04-01
1 2018-06-02
2 2020-09-03
3 2020-09-04
dtype: datetime64[ns]
The to_datetime231 function expects that columns are named 'day', 'month', 'year'. It returns a series
of type datetime64 which may be converted to a DatetimeIndex.
Indexing
Exact Indexing
Using Timestamp objects for label based indexing yields items with the corresponding time stamp, if there are any.
Slicing works as usual.
print(s.loc[pd.Timestamp('2018-3-16')], '\n')
#print(s.loc[pd.Timestamp('2018-3-16 10:00')], '\n') # KeyError
print(s.loc[pd.Timestamp('2018-3-16 00:00')], '\n')
print(s.loc[pd.Timestamp('2018-3-16'):pd.Timestamp('2018-3-20')])
2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
Freq: D, dtype: int64
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
Freq: D, dtype: int64
231 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
Inexact Indexing
Passing strings containing partial dates/times selects time ranges. This technique is referred to as partial string index-
ing. Slicing is allowed.
print(s.loc['2018-3'], '\n')
print(s.loc['2018-3':'2018-4'])
2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
...
2018-06-17 96
2018-06-18 97
2018-06-19 98
2018-06-20 99
2018-06-21 100
Freq: D, Length: 100, dtype: int64
2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
2018-03-24 11
2018-03-25 12
2018-03-26 13
2018-03-27 14
2018-03-28 15
2018-03-29 16
2018-03-30 17
2018-03-31 18
Freq: D, dtype: int64
2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
2018-03-24 11
2018-03-25 12
2018-03-26 13
2018-03-27 14
2018-03-28 15
2018-03-29 16
(continues on next page)
Inexact indexing has some pitfalls, which are described in Partial string indexing232 and Slice vs. exact match233 of
the Pandas user guide.
Pandas provides lots of functions for working with time stamp indices. Some are:
• asfreq234 (upsampling with fill values or filling logic)
• shift235 (shift index or data by some time period)
• resample236 (downsampling with aggregation, see below)
The resample method returns a Resampler237 object, which provides several methods for calculating data values
at the new time stamps. Examples are sum, mean, min, max. All these methods return a series or a data frame.
s2 = s.resample('5D').sum()
s2
232 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing
233 https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#slice-vs-exact-match
234 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.asfreq.html
235 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html
236 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html
237 https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html
2018-03-14 15
2018-03-19 40
2018-03-24 65
2018-03-29 90
2018-04-03 115
2018-04-08 140
2018-04-13 165
2018-04-18 190
2018-04-23 215
2018-04-28 240
2018-05-03 265
2018-05-08 290
2018-05-13 315
2018-05-18 340
2018-05-23 365
2018-05-28 390
2018-06-02 415
2018-06-07 440
2018-06-12 465
2018-06-17 490
Freq: 5D, dtype: int64
Period indices work analogously to time stamp indices. Corresponding class is PeriodIndex238 .
2018-03-14 1
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
2018-03-19 6
2018-03-20 7
2018-03-21 8
2018-03-22 9
2018-03-23 10
Freq: D, dtype: int64
Indexing with time stamps selects the appropriate period, like with IntervalIndex objects:
s.loc[pd.Timestamp('2018-03-15 12:34')]
238 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.PeriodIndex.html
With time stamp index the above line would lead to a KeyError. But for periods it’s interpreted as: select the
period containing the time stamp.
Similar is possible with slicing:
2018-03-15 2
2018-03-16 3
2018-03-17 4
2018-03-18 5
Freq: D, dtype: int64
Next to numerical and string data one frequently encounters categorical data. That is data of whatever type with finite
range. Admissible values are called categories. There are two kinds of categorical data:
• nominal data (finitely many different values without any order)
• ordinal data (finitely many different values with linear order)
Examples:
• colors red, blue, green, yellow (nominal)
• business days Monday, Tuesday, Wednesday, Thursday, Friday (ordnial)
Pandas provides explicit support for categorical data and indices. Major advantages of categorical data compared to
string data are lower memory consumption and more meaningful source code.
import pandas as pd
Pandas has a class Categorical to hold a list of categorical data with (ordinal) or without (nominal) ordering.
Such Categorical objects can directly be converted to series or columns of a data frame. Almost always category
labels are strings, but any other data type is allowed, too.
s = pd.Series(cat_data)
s
0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['red', 'green', 'blue']
Passing dtype='category' to series or data frame constructors works, too. Categories then are determined
automatically.
0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['blue', 'green', 'red']
0 red
1 green
2 blue
3 green
4 green
dtype: category
Categories (3, object): ['blue', 'green', 'red']
ordered=True))
print(quality.min())
print(quality.max())
poor
excellent
Instead of using general categorical data type we may define new categorical types. Strictly speaking cate-
gorical isn’t a well defined type because we have to provide the category labels to obtain a full-fledged data type. A
more natural way for using categories is to define a data type for each set of categories via CategoricalDtype239 .
A further advantage is that the same set of categories can be used for several series and data frames simultaneously.
239 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.CategoricalDtype.html
0 red
1 red
2 NaN
3 blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']
Most machine learning algorithms expect numerical input. Thus, categorical data has to be converted to numerical
data first.
For ordinal data one might use numbers 1, 2, 3,… instead of the original category labels. But for nominal data the
natural ordering of integers adds artificial structure to the data, which might affect an algorithm’s behavior. Thus, one
hot encoding usually is used for converting nominal data to numerical data.
The idea is to replace a variable holding one of 𝑛 categories by 𝑛 boolean variables. Each new variable corresponds to
one category. Exactly one variable is set to True. Pandas supports this conversion via get_dummies240 function.
df = pd.get_dummies(s)
df
0 red
1 red
2 green
3 blue
dtype: category
Categories (4, object): ['red', 'green', 'blue', 'yellow']
Series or data frame columns with categorical data have a cat member providing access to the set of categories.
Some member functions are:
• rename_categories241 (modify category labels),
• add_categories242 (add category; at the highest position, if ordinal),
• remove_categories243 (remove category, replacing corresponding items by nan),
• union_categories244 (join sets of categories).
240 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
241 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.rename_categories.html
242 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.add_categories.html
243 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.remove_categories.html
244 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.union_categoricals.html
Information about categories cannot be stored in CSV files. Instead, category labels are written to the CSV file in
their native data type. When reading CSV data to a data frame, columns have to be converted to categorical types
again, if desired.
Pandas supports categorical indices via CategoricalIndex objects. Simply pass a Categorical object as
index when creating a series or a data frame.
ordered=True)
s = pd.Series([3, 4, 2, 23, 41, 5], index=quality)
print(s, '\n')
s = s.sort_index()
s
poor 3
good 4
excellent 2
good 23
very good 41
poor 5
dtype: int64
poor 3
poor 5
good 4
good 23
very good 41
excellent 2
dtype: int64
print(s.loc['poor'], '\n')
print(s.loc['poor':'very good'])
poor 3
poor 5
dtype: int64
poor 3
poor 5
good 4
good 23
very good 41
dtype: int64
Continuous data or discrete data with too large range can be converted to categories by providing a list of intervals
(bins) in which items shall be placed. Each bin can be regarded as a category. Binning is important for machine
learning tasks which require discrete data. The pd.cut245 function implements binning.
Restructuring and aggregation are two basic methods for extracting statistical information from data. We start with
groupwise aggregation and then discuss several forms of restructuring without and with additional aggregation.
import pandas as pd
18.6.1 Grouping
Grouping is the first step in the so-called split-apply-combine procedure in data processing. Data is split into groups
by some criterion, then some function is applied to each group, finally results get (re-)combinded. Typical functions
in the apply step are sum or mean (more general: aggregation) or any type of transform or filtering functions (drop
groups containing nan items, for instance).
This chapter follows the structure of the Pandas user guide246 , but leaves out sections on very specific details. Feel
free to have a look at those details later on.
Grouping is done by calling the groupby247 method of a series or data frame. It takes a column label or a list of
column labels as argument and returns a SeriesGroupBy or DataFrameGroupBy object. The returned object
represents a kind of list of groups, each group being a small series or data frame. All rows in a group have identical
values in the columns used for grouping.
The ...GroupBy object offers several methods for working with the determined groups. Iterating over such objects
is possible, too.
Grouping by one column and subsequent aggregation yields an index with values from the column used for grouping:
g = df.groupby('age')
df_means = g.mean()
df_means
age score
0 2 2.3
1 3 4.5
2 3 3.4
(continues on next page)
245 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
246 https://pandas.pydata.org/docs/user_guide/groupby.html
247 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
age: 2
age score
0 2 2.3
3 2 2.0
age: 3
age score
1 3 4.5
2 3 3.4
age: 4
age score
4 4 5.4
age: 5
age score
5 5 7.2
6 5 2.8
7 5 3.9
score
age
2 2.150000
3 3.950000
4 5.400000
5 4.633333
g = df.groupby(['age', 'answer'])
df_means = g.mean()
display(df_means)
age: 2
answer: no
age: 2
answer: yes
age: 3
answer: no
age: 4
answer: no
age: 5
answer: no
age: 5
answer: yes
score
age answer
2 no 2.00
yes 2.30
(continues on next page)
Grouping by levels of a multi-level index is possible by providing the level argument to groupby.
With get_group we have access to single groups:
g.get_group((5, 'yes'))
g = df.groupby('age')
g['answer'].get_group(5)
5 yes
6 yes
7 no
Name: answer, dtype: object
Aggregation
To apply a function to each column of each group use aggregate248 . It takes a function or a list of functions as
argument. Providing a dictionary of column: function pairs allows for column specific functions.
import numpy as np
display(g.aggregate(np.min))
display(g.aggregate([np.min, np.max]))
display(g.aggregate({'answer': np.min, 'score': np.mean}))
answer score
age
2 no 2.0
3 no 3.4
4 no 5.4
5 no 2.8
answer score
amin amax amin amax
age
2 no yes 2.0 2.3
3 no no 3.4 4.5
4 no no 5.4 5.4
5 no yes 2.8 7.2
answer score
age
2 no 2.150000
3 no 3.950000
4 no 5.400000
5 no 4.633333
g.size()
age
2 2
3 2
4 1
5 3
dtype: int64
Many aggregation functions are directly accessible from the ...GroupBy object. Examples are ...GroupBy.
sum and ...GroupBy.mean. See Computations / descriptive stats249 for a complete list.
249 https://pandas.pydata.org/docs/reference/groupby.html#computations-descriptive-stats
Transformation
The transform250 method allows to transform rows groupwise resulting in a data frame with same shape as the
original one.
g = df.groupby('age')
Filtering
To remove groups use filter251 method. It takes a function as argument and returns a data frame with rows
belonging to removed groups removed. The passed function gets the group (series or data frame) and has to return
True (keep group) or False (remove group).
g = df.groupby('age')
There are three basic techniques for restructuring data in a data frame:
• pivot252 (interprets two specified columns as row and column index)
• stack253 /unstack254 (move (level of) column index to (level of) row index and vice versa)
• melt255 (create new column from some column labels)
Details and graphical illustrations of these technique may be found in Pandas’ user guide256 (first three sections).
Pandas supports pivot tables via pivot_table257 function. Pivot tables are almost the same as pivoting with
pivot258 but allow for multiple values per data cell, which then are aggregated to one value.
Details may be found in Pandas’ user guide259 .
Similar functionality is provided by crosstab260 . See Pandas user guide261 , too.
Similar to the discussion in Efficiency Considerations (page 173) for NumPy with Pandas we have to take care of how
we implement certain operations, at least if performance matters. NumPy guidlines carry over to Pandas, but some
additional remarks are in order.
import pandas as pd
252 https://pandas.pydata.org/docs/reference/api/pandas.pivot.html
253 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack
254 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.unstack
255 https://pandas.pydata.org/docs/reference/api/pandas.melt.html
256 https://pandas.pydata.org/docs/user_guide/reshaping.html
257 https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
258 https://pandas.pydata.org/docs/reference/api/pandas.pivot.html
259 https://pandas.pydata.org/docs/user_guide/reshaping.html#pivot-tables
260 https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html
261 https://pandas.pydata.org/docs/user_guide/reshaping.html#cross-tabulations
18.7.1 Vectorization
Analogously to NumPy, in Pandas we should avoid iterating over rows of series or data frames. Almost always
vectorization is possible. For numeric columns Pandas relies on NumPy’s vectorized function calls. For string and
date/time data Pandas implements tailor-made vectorization techniques.
Indices, series and data frame columns containing string data have a member str providing typical string operations.
Calling such a method applies the operation to each data item.
s.str.upper()
0 ABC
1 DEF
2 GHIJKLMN
dtype: object
Indices, series and data frame columns containing timestamp data have a member dt providing typical date/time
operations. Calling such a method applies the operation to each data item.
s.dt.dayofweek
0 5
1 6
2 0
dtype: int64
Pandas has a function eval264 which executes Python-like code provided as string. Due to (very complicated CPU
caching and other) optimization techniques eval is faster for long expressions involving large data frames than
standard Python code. The DataFrame.query265 method provides a simplified interface to eval for selecting
rows via boolean operations on columns.
Both methods should only be used for operations on very large data frames. For small data frames they are significantly
slower than standard Python. Have look at Expression evaluation via eval()266 in Pandas’ user guide for details.
262 https://pandas.pydata.org/docs/user_guide/text.html#method-summary
263 https://pandas.pydata.org/docs/user_guide/basics.html#dt-accessor
264 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html
265 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
266 https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#expression-evaluation-via-eval
Sometimes data sets are too large to load the whole data set to memory. Pandas supports partial loading and there
are other Pandas-like Python libraries supporting data sets larger than memory.
Partial Loading
The pd.read_csv267 function supports chunking, that is, loading data in chunks. After processing a chunk it gets
removed from memory and the next chunk can be read to memory. See Iterating through files chunk by chunk268 in
Pandas’ user guide.
Other Libraries
Dask269 is a parallel computing library with Pandas-like API. It allows for faster processing of large data sets. Have
a look at Use other libraries270 in Pandas’ user guide for a quick introduction.
267 https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
268 https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#iterating-through-files-chunk-by-chunk
269 https://www.dask.org/
270 https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#use-other-libraries
Exercises
237
CHAPTER
NINETEEN
COMPUTER BASICS
To solve these exercises on bits and bytes and representations of numbers you should have read Computers and
Programming (page 33).
A hard disk is advertised to have a capacity of 2 TB. Give the hard disk’s capacity in TiB.
Solution
# your solution
An image has a width of 4000 pixels and a height of 3000 pixels. For each pixel three color components (red, green,
blue) have to be saved, where 8 bits per component are required. How much space does the image need when saving
it to a file without compression?
Solution
# your solution
19.1.3 Books
A book with 500 pages and 1500 characters per page in average shall be saved to a file. Each character requires one
byte of disk space. What is the file’s total size?
A well known online encyclopedia has about 50 GB of English text (without images). How many books are needed
for a print version?
How long does it take to transfer 50 GB of data with a transfer rate of 100 megabit per second?
Solution
# your solution
239
Data Science and Artificial Intelligence for Undergraduates
How much storage is required for an uncompressed 120 minutes video file with 30 frames (that is, images) per second
and full HD resolution (1920x1080 pixels, 24 bits per pixel).
What’s the compression rate if the video file has size 4.1 GB?
Solution
# your solution
# your solution
# your solution
# your solution
Write the 32-digit binary number 10001001 11111110 00001010 01001101 as hexadecimal number.
Solution
# your solution
Give two differences between a computer’s memory and a mass storage device.
Solution
# your solution
# your solution
TWENTY
PYTHON PROGRAMMING
Computer programs contain lots of errors. Finding them is much more difficult than correcting them. In this series
of exercises you have to find syntax and semantic errors in small programs. Before you start you should have read
Building Blocks (page 46).
Hint: Simply run the programs and let the Python interpreter look for errors. Then correct all errors identified by
the interpreter. If still something is not working as expected, have a closer look at the source code.
20.1.1 Simple 1
Solution:
# your modifications
243
Data Science and Artificial Intelligence for Undergraduates
20.1.2 Simple 2
a = 2
b = 3
if a < b
print(a, '<', b)
Solution:
# your modifications
a = 2
b = 3
if a < b
print(a, '<', b)
20.1.3 Simple 3
def say_something(text):
say_samething('Python is great!')
Solution:
# your modifications
def say_something(text):
say_samething('Python is great!')
20.1.4 Simple 4
a = 2
b = 3
if a = b:
print(a, 'equals', b)
else:
print('not equal')
Solution:
# your modifications
a = 2
b = 3
Solution:
# your modifications
Solution:
# your modifications
my_list = [1, 3, 4, 2, 6]
Solution:
# your modifications
my_list = [1, 3, 4, 2, 6]
last = 5
Solution:
# your modifications
last = 5
Solution:
# your modifications
l = [2, 3, 2, 5, 7, 8, 9]
Solution:
# your modifications
l = [2, 3, 2, 5, 7, 8, 9]
l = [2, 4, 6, 3]
Solution:
# your modifications
l = [2, 4, 6, 3]
20.1.12 Difficult 1
total_sum = 0
for i in range(0, len(my_lists)):
for j in range(0, len(my_lists[0])):
total_sum = total_sum + my_lists[j][i]
print(total_sum)
Solution:
# your modifications
total_sum = 0
for i in range(0, len(my_lists)):
for j in range(0, len(my_lists[0])):
total_sum = total_sum + my_lists[j][i]
print(total_sum)
20.1.13 Difficult 2
last = 19
Solution:
# your modifications
last = 19
20.1.14 Difficult 3
def print_first_half(l):
'''print first half of list, include center item if length of list is odd'''
def print_second_half(l):
'''print second half of list, omit center item if length of list is odd'''
l = [2, 4, 3, 6, 8, 5, 7]
print_first_half(l)
print_second_half(l)
l = [2, 4, 3, 6, 8, 5]
print_first_half(l)
print_second_half(l)
l = [1]
print_first_half(l)
print_second_half(l)
l = []
print_first_half(l)
print_second_half(l)
Solution:
# your modifications
def print_first_half(l):
'''print first half of list, include center item if length of list is odd'''
def print_second_half(l):
(continues on next page)
l = [2, 4, 3, 6, 8, 5, 7]
print_first_half(l)
print_second_half(l)
l = [2, 4, 3, 6, 8, 5]
print_first_half(l)
print_second_half(l)
l = [1]
print_first_half(l)
print_second_half(l)
l = []
print_first_half(l)
print_second_half(l)
20.2 Basics
Before solving these programming exercises you should have read Building Blocks (page 46). Only use Python features
discussed there.
If light travels 300 million meters per second, how many kilometers travels light in one hour?
Solution:
# your solution
A 20 years old person started watching web videos at the age of 8. Everyday the person watches videos for 2 hours.
If the person would have watched videos the same total number of hours, but all the day instead of only 2 hours, how
many years of its young life would the person have wasted?
Take into account, that the person uses approximately 7 hours for sleeping and 3 ours for eating, grooming, and doing
housework. Thus, time for watching videos is less than 24 hours a day.
Assume that each year has 365 days.
Solution:
# your solution
Get two integers from the user. Divide the first by the second. Use floor division and show the remainder, too. Avoid
ZeroDivisionError.
Solution:
# your solution
Get three integers from the user and print a message if they are not in ascending order.
Solution:
# your solution
Get three integers from the user und print them in ascending order.
Solution:
# your solution
20.2.6 Functions 1
Write a function is_ascending which checks whether its three numeric arguments are in ascending order. Return
a boolean value.
Test the function with an ascending sequence and a non-ascending sequence.
Solution:
# your solution
20.2.7 Functions 2
Write a function cut_off_decimals which takes two arguments, a float and a positive integer. The function shall
cut off all but the specified number of decimal places of the float and then return the float. If the second argument is
negative, the float shall remain untouched.
Note, that floor devision also works for floats. Thus, x // 1 cuts off all decimal places.
Test the function with 1.2345 and 2 decimal places.
Solution:
# your solution
20.2.8 Loops 1
Print the numbers 0 to 9 with a while loop. In other words: Use a while loop to simulate a for loop.
Solution:
# your solution
20.2.9 Loops 2
Take a list and print its items in reverse order. Test your code with [-1, 2, -3, 4, -5].
Solution:
# your solution
20.2.10 Loops 3
Take a list and print it item by item. After each item ask the user whether he wants to see the next item. If not, stop
printing. If yes, go on.
Solution:
# your solution
20.2.11 Loops 4
Take a list and print it item by item. After each item ask the user whether he wants to see the next item. If not, stop
printing. If yes, go on. If there are no more items left, start again with the first item.
Solution:
# your solution
Solving this set of exercises increases your skills in algorithmic thinking and Python’s syntax. Everything you need
has been discussed in the Crash Course (page 43) chapter. Do not use additional Python features or modules.
Get two integers from the user and check whether the corresponding point lies inside the rectangle with corners at
(-1, -1), (5, -1), (5, 2), (-1, 2). Print a message showing the result.
Solution:
# your solution
Get an integer from the user and tell the user whether it’s a square number or not. If you want to compute square
roots, use 123 ** 0.5. Print a message if the user gave a negative number.
Hint: Have a look at the output of 16.125 % 1.
Solution:
# your solution
Write a function no_duplicates which returns True if the passed list contains no duplicates and False if
there are duplicates.
Test your function with [1, 4, 5, 6, 3] and [1, 3, 1] (and with [], of course).
Solution:
# your solution
Write a function inc_subseq which takes a list of numbers and prints all items except the ones which are smaller
than their predecessor.
Test your function (at least) with [1, 3, 2, 3, 4, -2, 9] and [3, 2, 1].
Solution:
# your solution
Get an integer radius of a circle from the user. Calculate and print the circle’s area as well as the edge length of a
square with identical area. Check user input for validity. You may use NumPy’s pi constant.
Solution:
# your solution
Solve the quadratic equation 𝑎 𝑥2 + 𝑏 𝑥 + 𝑐 = 0 with user-specified 𝑎, 𝑏, 𝑐 (integers). Give all real solutions.
Solution:
# your solution
Use Matplotlib and NumPy’s sin and cos functions to plot a regular polygon. Ask the user for the number of
vertices and check user input for validity.
Hint: For 𝑛 vertices the 𝑘th vertex is at (cos 𝜑, sin 𝜑) with 𝜑 = 2 𝜋 𝑛𝑘 .
Solution:
# your solution
20.3.8 Stars
Draw a star with user-specified number of outside vertices. Radius for inner vertices is 0.3, for outer vertices it’s 1.
Solution:
# your solution
Python’s approach to variables and operators is simple and beautiful, although beginners need some time to see both
simplicity and beauty. In each exercise below keep track of what’s happening in detail behind the scenes. That is,
track the creation of objects and which name is tied to which object. Before solving the exercises read the chapter on
Variables and Operators (page 75) (sections Operators as Member Functions (page 88) and Efficiency (page 89) are
not required here).
The following code contains an error. Find it (by running the code) and correct it.
def do_something():
n = n + 1
print('something')
Solution:
# your modifications
def do_something():
n = n + 1
print('something')
The following code shall print 10 lines each containing 5 numbers. Make it work correctly.
Solution:
# your modifications
a = [1, 2, 3, 4, 5]
squares = a
for i in range(0, len(squares)):
squares[i] **= 2
Solution:
# your modifications
a = [1, 2, 3, 4, 5]
squares = a
for i in range(0, len(squares)):
squares[i] **= 2
Why do the following two code cells yield different results? Explain in detail what’s happening!
a = 2
b = a
a = 5
print(b)
a = [2]
b = a
a[0] = 5
print(b[0])
Solution:
# your answer
20.4.5 2 > 3?
Guess why the condition 2 <= 3 == True evaluates to False. How to repair?
if 2 <= 3 == True:
print('2 <= 3')
else:
print('2 > 3, really?')
Solution:
# your modifications
if 2 <= 3 == True:
print('2 <= 3')
else:
print('2 > 3, really?')
2 > 3, really?
a = 5
b = 2
c = 3
print(a - b - c)
a = 5
b = 2
c = 3
a -= b - c
print(a)
Solution:
# your answer
Here you find some exercises on Python’s optimization techniques for memory management discussed in Efficiency
(page 89).
The last two tasks demonstrate effects of running out of memory and how to prevent such situations. Read Garbage
Collection (page 91) first.
a = 100
b = 2 * a
c = 2 * a
print(b is c)
True
a = 100
b = 3 * a
c = 3 * a
print(b is c)
False
Solution:
# your answer
Copy the following code to a text file and feed the file to the Python interpreter. Why do both variants yield different
results? Why running in Jupyter yields identical results in both cases?
# variant 1
a = 1234
b = 1234
print(a is b)
# variant 2
c = 34
a = 1200 + c
b = 1200 + c
print(a is b)
False
False
Solution:
# your answer
Copy the following code to a text file and feed the file to the Python interpreter. Do you have an idea why the next
code yields True?
a = 1200 + 34
b = 1200 + 34
print(a is b)
False
Solution:
# your answer
Run the following code and observe what happens to your system. Use some system tool to monitor memory con-
sumption. You may interrupt the Python kernel in Jupyter if necessary.
Warning: Save all your data you are currently editing. Depending on your system you may have to reboot your
machine due to hanging the system.
stop = False
data = 'x'
if fac == '':
stop = True
else:
fac = int(fac)
print('increasing memory usage to', fac * len(data), 'bytes...')
new_data = ''
for i in range(0, fac):
new_data += data
data = new_data
del new_data
print('...done')
del data
Note the del data line. This line is not required if the program is run as a stand-alone program (text file fed to the
interpreter), because at exit the interpreter will free all memory used by the program. But in Jupyter the interpreter
does not stop at the end of a code cell. All objects created by the cell remain in memory until the kernel dies or
memory is explicitly freed via del.
Solution:
# your answer
Write a function which returns a string with approximately 1 billion characters. Then write a program which can
manage three data sets. Repeatedly ask the user whether he or she wants the exit the program or whether he or she
wants to load/remove data set 1/2/3. Load and remove data sets according to the user’s choice. Monitor memory
consumption while running the program to see whether the program works as intended.
Solution:
# your solution
To solve these exercises you yould have read Lists and Friends (page 93). Only use features discussed there or in
previous chapters.
Consider the following code snipped. What numbers appear on screen when printing a, b, e, g, and h after executing
the code? Don’t run the code, interpret each line manually.
a = 1
b = 2
c = [a, b]
c[0] = 3
d = c
d[1] = 4
c.append(5)
e = c[-1]
f = c[0:-1]
g = f[-1]
h = d[0]
Solution:
# your answer
The following code yields incorrect outputs. Find the problem and solve it.
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
d = [a, b, c]
# compute squares
for i in range(0, 3):
for j in range(0, 3):
d[i][j] **= 2
# print results
print('squares of d:', d)
print('sum of a:', sum_a)
print('sum of b:', sum_b)
print('sum of c:', sum_c)
Solution:
# your modifications
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9]
d = [a, b, c]
# compute squares
for i in range(0, 3):
for j in range(0, 3):
d[i][j] **= 2
# print results
print('squares of d:', d)
print('sum of a:', sum_a)
print('sum of b:', sum_b)
print('sum of c:', sum_c)
Write a function shift which takes a list, increases each item’s index by one (last item becomes first one), and
returns the resulting list. Example: [1, 3, 5, 7]should become [7, 1, 3, 5]. Don’t use loops, but slicing
syntax.
Solution:
# your solution
Write a function shift_n which takes a list and a positive integer n and shifts the list n times (cf. previous task).
Don’t use loops. Do not forget to think about the case that n is larger than the length of the list. Examples:
• shift_n([1, 2, 3, 4, 5], 3) should be [3, 4, 5, 1, 2].
• shift_n([1, 2, 3, 4, 5], 5) should be [1, 2, 3, 4, 5].
• shift_n([1, 2, 3, 4, 5], 6) should be [5, 1, 2, 3, 4].
Solution:
# your solution
Given two lists keys and values create a dictionary. Use a for loop to fill an empty dictionary item by item. Test
case:
Solution:
# your solution
Given two lists keys and values create a dictionary. Use a dictionary comprehension. Test case:
Solution:
# your solution
Given two lists keys and values create a dictionary. Call dict and pass a list of key-value pairs. Test case:
Solution:
# your solution
20.6.8 One-Liner 1
Given a list of numbers, write one line of code to create a new list containing only numbers greater than 3. Test case:
[4, 3, 2, 8, 6, 0, 4, 6, -2, 1]
Solution:
# your solution
20.6.9 One-Liner 2
In one line of code create a new list containing every second number from a given list, if the number is between 1
and 10 (both included). Test case:
Solution:
# your solution
20.6.10 One-Liner 3
Write one line of code to square all numbers in a list of lists of numbers. Test case:
Solution:
# your solution
Make a copy of a list of lists of numbers. In the end, changing a number in the copy must not modify the original
numbers. Test case:
Solution:
# your solution
20.7 Strings
Read Strings (page 105) before you start with the exercises.
Print a list of all characters appearing in a string. Count how often each character appears without using str.
count. Get the string from user input.
Solution:
# your solution
Print the string I like apples and melons. where apples and melons shall be replaced by a suitable
Unicode symbol.
Solution:
# your solution
20.7.3 Parser
Write a function which converts a string representation of a table of integers to a list of lists of integers. The elements
of a row are separated by commas and rows are separated by semicolons. Test your function with
'1,2,3;4,5,6;7,8,9'
Result should be
# your solution
Write a Python script which prompts the user for some input and translate user input to spoon language271 .
Hint: Simple replacement does not work. If, for instance, at first a is replaced by alewa, subsequent replacement of e
will alter the lew in alewa. There seems to be no replacement order for vowels working correctly in each case. Thus,
in a first run replace all vowels by some code (unlikely to appear in a text), then, in a second run, replace the codes
by the vowel’s lew-version.
Solution:
# your solution
271 https://de.wikipedia.org/wiki/Spielsprache#L%C3%B6ffelsprache
Read Accessing Data (page 111) before you start with the exercises.
Read some text file’s content, convert it to lower case and save it to a new text file.
Solution:
# your solution
Get a CSV file containing all public trees at Chemnitz from the Open data portal of Chemnitz272 . Read the first 10
lines from the file and show them on screen.
Hint: If you encounter cumbersome symbols in the output, have a look at byte order marks273 at Wikipedia.
Solution:
# your solution
Get ‘The Blog Authorship Corpus’ from the web. The original source https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
vanished in 2022. Use https://www.fh-zwickau.de/~jef19jdw/datasets/blogs.zip. The ZIP file contains an info.
txt file and the original ZIP file.
Read infos.txt to get information about the file name format used in the ZIP file.
Write a Python program which extracts all 5 features from the file names and saves them in a CSV file.
Solution:
# your solution
Open the file 7596.male.26.Internet.Scorpio.xml from ‘The Blog Authorship Corpus’ (see exercise
above) without extracting it explicitly. Print the first and the last post in the file to screen.
Solution:
# your solution
272 http://portal-chemnitz.opendata.arcgis.com
273 https://en.wikipedia.org/wiki/Byte_order_mark
20.9 Functions
Read Functions (page 127) before you start with the exercises.
Write a function which squares all numbers in a list. The list of numbers is the only parameter and there is no return
value. The results have to be stored in the original list.
Test your code with
[1, 2, 3, 4, 5]
Solution:
# your solution
Write a function taking an arbitrary number of keyword arguments and printing all of them.
Test your code by passing
a: 1
b: test
c: (1, 2, 3)
Solution:
# your solution
# your solution
20.9.4 Composition
# your solution
20.9.5 Sorting
Sort a list of paths by file name. Use list.sort with custom sort key.
Test your code with
['/some/path/file.txt',
'/another/path/xfile.txt',
'file_without_path.xyz',
'../relative/path/abc.py',
'/no/extension/some_file.txt']
Result should be
['../relative/path/abc.py',
'/some/path/file.txt',
'file_without_path.xyz',
'/no/extension/some_file.txt',
'/another/path/xfile.txt']
Solution:
# your solution
Given an integer 𝑛 calculate 𝑛! (see Combinatorics (page 331) for a definition) using a loop. Then calculate 𝑛! by
exploiting the recursive rule 𝑛! = 𝑛 ⋅ (𝑛 − 1)!. Test both functions with 10! = 3628800.
Solution:
# your solution
Read Inheritance (page 143) before you start with the exercises.
(########):
//||||\\
Bonus: Everytime an Animal object is created a message ‘An animal with … legs is hiding somewhere.’
Test your code by creating and showing animals with 0, 1,…, 8 legs.
Solution:
# your solution
20.10.2 Dogs
Derive a class Dog from the Animal class (cf. exercise above). Implement a proper say_something method
(‘Wau’, for instance) and reimplement show_up to show a dog instead of an abstract animal.
Note, that the constructor only takes the dog’s weight as argument. The constructor should call the base class’ con-
structor to set the correct number of legs (and show the bonus message, if implemented).
In your test code also check if the dog is lightweight.
Solution:
# your solution
274 https://en.wikipedia.org/wiki/ASCII_art
Add methods sit_down and stand_up to your Dog class. Depending on which of both methods was called last,
show_up shall show a sitting or a standing dog.
Solution:
# your solution
20.10.4 Fish
Derive a Fish class from Animal. Add a member function to_sticks. After calling this function once,
show_up should show 10 fish sticks per kilogram of the fish instead of a live fish.
Don’t forget to implement say_something (maybe ‘Blub’ or, more realistic, ‘’).
Solution:
# your solution
TWENTYONE
MANAGING DATA
Before solving these basic NumPy exercises you should have read NumPy Arrays (page 155), Array Operations
(page 161), Advanced Indexing (page 165), and Vectorization (page 166). Only use NumPy features discussed there.
import numpy as np
21.1.1 Maximum
Given two arrays print the largest value of all elements from both arrays. Use np.max or np.maximum or both.
Test your code with
1 2 3
⎛
⎜4 5 6⎞⎟ and (1 2 3) .
⎝7 8 9⎠
Solution:
# your solution
Print the largest elementwise absolute difference between two equally shaped arrays.
Test your code with
1.3 2.4 1.25 2.34
( ) and ( )
3.5 6.5 3.499 6.55
Solution:
# your solution
269
Data Science and Artificial Intelligence for Undergraduates
21.1.3 Comparisons 1
Given an array of integers, test whether there is a one in the array. Print a message showing the result.
Test your code with
1 2 3 0 2 3
⎛
⎜4 5 6⎞⎟ and also with ⎛
⎜4 5 6⎞⎟.
⎝7 8 9⎠ ⎝7 8 9⎠
Solution:
# your solution
21.1.4 Comparisons 2
Test whether all elements of an array are positive. Print a message showing the result.
Test your code with
1 2 3 1 −2 3
⎛
⎜4 5 6⎞⎟ and also with ⎛
⎜−4 5 6⎞⎟.
⎝7 8 9⎠ ⎝7 8 9⎠
Solution:
# your solution
21.1.5 Indexing 1
Set all elements between -1 and 1 to zero. Remember: Avoid loops wherever possible.
Test you code with
−2 −0.5 0
⎛
⎜0.4 5 0.9⎞
⎟
⎝1 8 9. ⎠
Solution:
# your solution
21.1.6 Indexing 2
Write a function which takes an integer 𝑛 and returns an 𝑛 × 𝑛 matrix with ones at the borders and zeros inside. Data
type of the matrix should be int8.
Example for 𝑛 = 5:
1 1 1 1 1
⎛
⎜1 0 0 0 1⎞⎟
⎜
⎜ ⎟
⎜1 0 0 0 1⎟⎟ .
⎜
⎜1 ⎟
0 0 0 1⎟
⎝1 1 1 1 1⎠
Solution:
# your solution
21.1.7 Indexing 3
Write a function which takes an arbitrarily sized matrix and returns a matrix having the same elements as the original
matrix, but being bordered by additional zeros. The returned matrix should have the same data type as the original
matrix.
Example:
0 0 0 0 0
1 2 3 ⎛
⎜0 1 2 3 0⎞⎟
input: ( ), output: ⎜
⎜0 ⎟.
4 5 6 4 5 6 0⎟
⎝0 0 0 0 0⎠
Solution:
# your solution
21.1.8 Broadcasting
Take the first row of a matrix and add it to all other rows of the matrix. Print the resulting matrix.
Test your code with
1 2 3 4
⎛
⎜5 6 7 8⎞⎟.
⎝9 10 11 12⎠
Solution:
# your solution
Given a square matrix with odd number of rows/columns replace all but the boundary elements by zeros and write
the mean of all boundary elements to the center element. Print the modified matrix.
Example:
−1 2 −1 2 −1 −1 2 −1 2 −1
⎛
⎜ 2 −2 3 −2 2 ⎞
⎟ ⎛
⎜ 2 0 0 0 2⎞ ⎟
⎜
⎜ ⎟ ⎜ ⎟
original matrix: ⎜ −1 3 −3 3 −1⎟⎟ , result: ⎜
⎜ −1 0 0.5 0 −1⎟
⎟ .
⎜
⎜ 2 −2 3 ⎟ ⎜ ⎟
−2 2 ⎟ ⎜2 0 0 0 2⎟
⎝−1 2 −1 2 −1⎠ ⎝−1 2 −1 2 −1⎠
Solution:
# your solution
Images can be represented by NumPy arrays with two (grayscale) or three (color) dimensions. Thus, basic image
processing like color transforms and cropping reduce to operations on NumPy arrays. Before solving the exercises you
should have read Efficient Computations with NumPy (page 155). Only use NumPy features, no additional modules.
To show results (images) on screen use Matplotlib.
import numpy as np
import matplotlib.pyplot as plt
def show_image(img):
''' Show 2d or 3d NumPy array img as image (color range 0...1). '''
# check range
if img.min() < 0 or img.max() > 1:
print('Color values out of range!')
return
# check dims
if img.ndim == 2: # disable color normalization for grayscale
plt.imshow(img, vmin=0, vmax=1, cmap='gray')
elif img.ndim == 3:
plt.imshow(img)
else:
print('Wrong number of dimensions!')
return
plt.show()
Hint: Start with the first exercise and complete exercises one by one. Each exercise will use results from the previous
one.
Load all images (NumPy arrays) contained in pasta.npz. The file contains three color images. Name the arrays
img1, img2, img3.
Print the images’ shapes and show the images with show_image. What’s the numeric range of the pixel values
(floats 0…1 or integers 0…255)?
Solution:
# your solution
Write a function rgb2gray which takes a color image (3d array) and returns a grayscale image (2d array). Calculate
gray levels as pixelwise mean of red, green and blue values.
Apply the function to the three images and show results.
Hint: NumPy has a mean function. The axis parameter could be of interest.
Solution:
# your solution
21.2.3 Cropping
Each image shows a piece of alphabet pasta on a perfect black background (color value 0). Write a function
auto_crop which finds the bounding box of a grayscale image and returns a copy (not a view!) of the corre-
sponding subarray. The bounding box is the rectangle that contains the pasta piece without black margin.
Apply the function to the grayscale pasta images and show the results.
Solution:
# your solution
21.2.4 Centering
Write a function center which takes a grayscale image and an integer n. The return value shall be a grayscale image
of size nxn with black background and the passed image positioned in the new images center (identical margin width
at all sides).
Place each cropped pasta image in a 50x50 image and show the results.
Solution:
# your solution
Implement the above auto_crop function without using loops, that is, fully vectorized. Compare execution times.
Solution:
# your solution
Before solving these basic Pandas exercises you should have read Series (page 188) and Data Frames (page 198).
For these exercises we use a dataset describing used cars obtained from kaggle.com275 . Licences: Open Data Com-
mons Database Contents License (DbCL) v1.0276 and Open Data Commons Open Database License (ODbL) 277 .
import pandas as pd
data = pd.read_csv('cars.csv')
275 https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
276 http://opendatacommons.org/licenses/dbcl/1.0/
277 https://opendatacommons.org/licenses/odbl/summary/
Basic Information
# your solution
Missing Values
# your answer
Value Counts
# your solution
# your solution
New Columns
Append a column 'manual_trans' containing True where column 'transmission' shows 'Manual',
else False.
Append a column 'age' showing a car’s age (now minus 'year').
Solution:
# your solution
278 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html
279 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html
Remove Columns
# your solution
Create a Pandas series price with column 'name' as index and column 'selling_price' as data.
Solution:
# your solution
Mean
# your solution
Boolean Indexing
Use boolean row indexing to get a data frame one_model with columns 'km_driven' and 'age' containing
only rows with 'name' equal to 'Maruti Swift Dzire VDI'.
Solution:
# your solution
New Column
Add a column 'km_per_year' to the one_model data frame containing kilometers per year.
Solution:
# your solution
Mean
# your solution
Find the oldest car in data and print its name and manufacturing year. Have a look at Pandas’ documentation280 for
suitable functions.
Solution:
# your solution
Before solving these exercises you should have read Advanced Indexing (page 207) and Dates and Times (page 217).
import pandas as pd
21.4.1 Cars
For these exercises we use a dataset describing used cars obtained from kaggle.com281 . Licences: Open Data Com-
mons Database Contents License (DbCL) v1.0282 and Open Data Commons Open Database License (ODbL) 283 .
data = pd.read_csv('cars.csv')
Create a multi-level index for the data frame from columns 'name' and 'year'.
Solution:
# your solution
280 https://pandas.pydata.org/docs/
281 https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
282 http://opendatacommons.org/licenses/dbcl/1.0/
283 https://opendatacommons.org/licenses/odbl/summary/
Select Model
Print all rows for the 'Maruti Swift Dzire VDI' 2018 model.
Solution:
# your solution
Diesel
Select all 2018 cars and use value_counts284 to get the percentage of Diesel cars.
Solution:
# your solution
Old Cars
Print all cars with more than 100000 kilometers driven and manufactured before 2000.
Solution:
# your solution
21.4.2 E-Mails
Consider an email account receiving emails every day. Use the following code to generate a list times of time
stamps representing arrival times of emails.
import numpy as np
rng = np.random.default_rng(0)
n_mails = 1000
start_time = pd.Timestamp('2019-01-01 00:00:00')
end_time = pd.Timestamp('2020-01-01 00:00:00')
Given the list of time stamps of incoming mails create a series with daily mail counts.
Solution:
# your solution
284 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html
Every day the user only answers mails received not after 7:00am that day. From the list of time stamps create a series
with daily mail counts at 7:00am. Hint: Have a look at the offset argument of Series.resample285 ; label
might be of interest, too.
Solution:
# your solution
Assume the user reads and answers emails at business days only (again, at 7:00am). Create a series containing the
numbers of mails to process at each business day.
Solution:
# your solution
Vacation
From the results of the previous task get the number of mails arriving during winter vacation in January and February.
Use a variable for the year of interest:
year = 2019
Write code which works for all years (leap year or not).
Solution:
# your solution
Before solving these exercises you should have read Advanced Indexing (page 207), Dates and Times (page 217),
Categorical Data (page 223), and Restructuring Data (page 227).
import pandas as pd
import numpy as np
21.5.1 Grades
Use the following code to create a series containing student IDs as index and points in exam as data:
rng = np.random.default_rng(123)
n_students = 20
max_points = 40
exam_points
What do the two points = ... lines in the above code do in detail?
Solution:
# your answer
Points to Grades
Add a column to the series (resulting in a data frame) containing corresponding grades. Conversion from points to
grade is as follows:
points grade
id
20077 24 3.3
23411 11 5.0
22964 33 2.3
...
# your solution
286 https://pandas.pydata.org/docs/reference/api/pandas.cut.html
Mean Grade
Get the mean grade for all students who passed the exam (grade better than 5).
Solution:
# your solution
21.5.2 Cafeteria
For these exercises we use the dataset obtained in the Cafeteria (page 315) project.
data
Dates
Convert 'date' column to Timestamp. Hint: the pd.to_datetime287 function is very flexible.
Solution:
# your solution
Categories
# your solution
Get mean students/staff/guests prices per category. Sort results by students price.
Solution:
# your solution
Drop rows with nan or 0.0 prices. Then get minimum, average, maxmium students prices per day. Create a data
frame with three columns 'min', 'mean' 'max' and DatetimeIndex. Call the data frame’s plot288 method
(works without arguments) to visualize the results.
Solution:
# your solution
287 https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html
288 https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
Restructuring
Create a data frame showing prices only, no meal names. Use dates for row index. Column index shall be multi-level
with first level showing the category and second level showing the price level (students/staff/guests).
Solution:
# your solution
Before solving these exercises you should have read High-Level Data Management with Pandas (page 187).
import pandas as pd
For these exercises we use the dataset obtained in the Cafeteria (page 315) project.
data
21.6.1 Preprocessing
Data Types
Convert 'date' and 'category' columns to Timestamp and category, respectively (see Advanced Pandas
(page 278) exercises).
Solution:
# your solution
Count Categories
# your solution
Remove Categories
Remove Categories which do not contain full-fledged meals descriptions (e.g., 'Salat Bar').
Solution:
# your solution
Most meals contain information on alergenes and additives (numbers in parantheses). Remove the information to
get more readably meal descriptions. Implement the removal procedure twice: without and with vectorized string
operations. Get and compare execution times.
Solution:
data['name_backup'] = data['name']
%%timeit
# your solution without vectorized string operations
data['name'] = data['name_backup']
data = data.drop(columns=['name_backup'])
%%timeit
# your solution with vectorized string operations
Create a new column 'simple' from the 'name' column by removing all lower-case words, all punctuation marks
and so on. Only words starting with an upper-case letter are allowed.
Solution:
# your solution
Given a key word (e.g., 'Kartoffel') get the number of meals containing the keyword and print all meal descrip-
tions.
Solution:
# your solution
For each day get the number of meals containing some keyword (e.g., 'Kartoffel'). Call Series.plot to
visualize the result.
Solution:
# your solution
Projects
283
CHAPTER
TWENTYTWO
There are lots of different ways for installing and using Python and related tools. Choose the one which suits your
needs and your approach to work.
• Working with JupyterLab (page 285)
• Install Jupyter Locally (page 289)
• Python Without Jupyter (page 292)
• Long-Running Tasks (page 295)
In this project you learn basic usage of JupyterLab289 , a web interface for Python programming. Read Python and
Jupyter (page 27) to get some information on the relation between Python and JupyterLab.
This project is available as video (with German audio and subtitles only):
Note: JupyterLab is not restricted to Python programming, but supports almost every other programming language.
Task: Open a webbrowser and go to some JupyterLab provider. Students of Zwickau University should go to
Gauss290 .
Hint: You may also run a local JupyterLab instance. See Install Jupyter Locally (page 289) project for installation
and start.
After starting JupyterLab you should see a file manager sidebar on the left and the launcher tab in the working area
on the right.
Task: Create a new notebook file by clicking the Python button in the launcher tab’s Notebook section.
This creates a file Untitled.ipynb in the current directory. To change the file name click ‘File’ and ‘Rename
Notebook…’ in the top menu.
289 https://jupyter.org
290 https://gauss.fh-zwickau.de
285
Data Science and Artificial Intelligence for Undergraduates
Fig. 22.1: File manager and launcher tab are shown in JupyterLab’s initial view.
a = 2
b = 3
print(a + b)
Fig. 22.2: Your notebook should look as depicted here if you’ve completed all tasks above.
22.1.3 Kernels
JupyterLab is the connection between you and Python. The code you write to a code cell is send to Python for
execution. JupyterLab watches for outputs of your program and displays them to you. The background Python part
is referred to as Python kernel.
The connection between a notebook in JupyterLab and the Python kernel is very loose. You may open a notebook
file without running a kernel. Then code execution isn’t possible. Or you may have a running Python kernel although
you closed your notebook.
Task: Close your notebook file’s tab. Then click the square symbol in the sidebar.
Fig. 22.3: The square button brings up a list of running kernels and some other information.
The Python kernel of our test notebook is still running. Clicking the kernel line reopens the notebook tab. The X
button (only shown on hover) shuts the kernel down.
Task: Shut down the kernel.
Hint: Instead of closing a tab and its kernel separately you may also click ‘File’ and ‘Close and Shutdown Notebook’
Task: Switch to the file manager in the sidebar and open your notebook (double click its name).
Task: Execute the first (that is, top most) cell.
Python complains about an unknown name. The fresh kernel hasn’t seen the a = 2 line up to now, because we did
not execute it. Executing the second cell makes the first work correctly.
Important: Always try to create notebooks which can be executed in the same order as cells appear in the notebook!
Hint: If your code takes too long to run or if it won’t stop for some reason, click ‘Kernel’ and ‘Interrupt Kernel’ in
the menu. This stops code execution and makes the kernel wait for new code.
Hint: To test your notebook’s behavior after launching a fresh kernel but without reopening the file, click ‘Kernel’
and ‘Restart Kernel…’ in the menu.
Up to now we only used code cells. Another important cell type are Markdown cells. Markdown is a markup language
for writing formatted text.
# Some Heading
## A Subsection
Task: Create a new cell (ESC, then A or B to insert new cell above or below current cell). Switch cell type to
‘Markdown’, either via dropdown in toolbar or via ESC, then M.
Task: Write some Markdown code to the cell and execute the cell. To modify Markdown code double click the cell.
Then edit and execute again.
22.1.5 Terminals
JupyterLab allows to run one or more terminals. A terminal is a text interface to the computer’s operating system.
There you can use operating system features and programs not accessible through JupyterLab. Exampels are copying
files are deleting non-empty directories.
Note: Almost all remote JupyterLab instances run on Linux machines. So in a terminal you have to use Linux
commands, not Windows. Linux, macOS, OpenBSD and many other operating systems share a common set of
commands. Only Windows has its own set.
Task: Open a terminal (click the Terminal button in the launcher tab’s Other section). Then type ls to get a list of
all files in the current directory. Then type logout to close the terminal.
The pwd command prints the current directory’s path. Commands for working with files are cd (change directory),
cp (copy), mv (move, rename), rm (remove), mkdir (create directory), rmdir (remove directory). See Unix
Commands291 for more commands and usage information.
291 https://en.wikibooks.org/wiki/Guide_to_Unix/Commands
Hint: If you close a terminal tab without typing logout the terminal remains active in the background. To reopen
it or to shut it down click the ‘Running Terminals and Kernels’ button in the sidebar.
To quit JupyterLab click ‘File’ and ‘Log Out’. Closing the browser tab without logging out may cause security prob-
lems.
Before leaving JupyterLab shut down all kernels and terminals. This saves resources on the server. Most Jupyter
providers (including Gauss at Zwickau University) will shut down inactive kernels and JupyterLab sessions after
some hours automatically.
Note: It’s possible to log out from JupyterLab and have a kernel running. Coming back some hours or days later
one can fetch the outputs of long running tasks. Corresponding workflow is described in the Long-Running Tasks
(page 295) project.
We want to set up an extensible system for Python development and data science in general, including Jupyter as one
component. Here we only install the base system. From time to time tools and Python libraries can be added on
demand.
Hint: A Python library is a collection of Python code files extending Python’s set of commands.
22.2.1 Conda
Task: Go to Miniconda Installer List295 and download a suitable installer for your system. Then follow the install
instructions296 for your system.
Conda is a command line tool. If you feel more comfortable with GUI tools, install Anaconda Navigator297 .
Task: Open a terminal and run conda install anaconda-navigator in it. This installs Anaconda
Navigator.
Depending on your system now there should be an entry for Anaconda Navigator in your system’s app menu. If not
add an entry manually. Anaconda Navigator executable should be in bin subdirectory of Miniconda’s installation
directory. On Non-Windows run which anaconda-navigator in a terminal to get the path.
Important: Before you install any additional tools with Anaconda Navigator or Conda, read on!
At the moment there is only one Python environment on your system, called base. Don’t install additional packages
to this environment. Create a separate environment for each kind of task, for instance, an environment you use to
work through projects and exercises in this book.
Task: Create a new Python environment ds-book. Either run conda create -n ds-book in a terminal or
go to ‘Environments’ page in Anaconda Navigator. Then click the plus button and follow the GUI instructions.
To switch between environments use Anaconda Navigator or run conda activate environment_name in
a terminal.
Note: Although you selected only one package for install, many more will be installed due to dependencies. Jupyter-
Lab requires a number of other packages and those packages may require others again. Conda manages such depen-
dencies for us.
The Home page of Anaconda Navigator shows a launch button for JupyterLab. Make sure you have selected the
correct environment in the dropdown above the launch buttons.
Task: Launch JupyterLab via Anaconda Navigator or from the command line: jupyter lab or jupyter-lab.
Freshly installed JupyterLab lives in the environment you installed it in. Creating a new Python environment requires
to install JupyterLab in this environment, too (if you want to use JupyterLab there).
Default behavior of JupyterLab is to show you your home directory and to disallow visiting directories outside your
home directory via JupyterLab. To access a different directory, run JupyterLab from a terminal. Then JupyterLab
will show the directory active in the terminal when launching JupyterLab.
295 https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links
296 https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation
297 https://docs.anaconda.com/anaconda/navigator/
298 https://xkcd.com/1987
Fig. 22.4: The Python environmental protection agency wants to seal it in a cement chamber, with pictorial messages
to future civilizations warning them about the danger of using sudo to install random Python packages.. Source:
Randall Munroe, xkcd.com/1987298
Fig. 22.5: The filter dropdown provides several filters for the package list.
There’s already a basic Python installation in your environment. So you can use Python in JuyterLab. Additional
packages (math, visualization,…) can be installed on demand in the same way we installed JupyterLab. Always keep
an eye on the environment name when installing. So things will end up in the correct environment.
Hint: Next to Conda there exist other package managers. A very prominent one is Pip299 . Conda automatically
installs Pip in each environment. To install a package with Pip simply write pip install package_name in
a terminal. Conda will take care of packages installed with Pip, too.
Some packages are only available for install with Conda or only for install with Pip. So both package manager have
to be used in parallel.
Recently, a JupyterLab Desktop App300 has been released. This brings the look and feel of usual GUI apps to
JupyterLab’s start-up process. After start-up there’s no difference to browser based JupyterLab.
Handling of different Python environments is somewhat more difficult than with plain JupyterLab.
Jupyter is only one of many Python IDEs (integrated development environments). Although Jupyter is well suited for
data science applications, because text, visualizations and code can be mixed in one and the same documents, other
tools may be appropriate, too.
Python (better: the Python interpreter) is a stand-alone program for running Python code on the command line. It
has an interactive mode, which executes each line of code immediately after writing. Alternatively, we may provide
a file containing Python code and the Python interpreter executes the file’s content.
Task: Open a terminal and type python (don’t forget to activate the correct environment with Conda or Anaconda
Navigator).
Now the Python interpreter runs in interactive mode. All Python commands are allowed. It’s, for instance, a powerful
replacement for a calculator.
Task: Type the following line by line
1 + 2 * 3
a = 2
b = 3
(a + b) ** 3
The result of each line is printed on screen immediately after hitting the return key.
To quit the interpreter we have to call the Python function for leaving a program.
Task: Type exit().
299 https://pypi.org/project/pip/
300 https://github.com/jupyterlab/jupyterlab-desktop
We may write Python code to a text file and hand it over to the Python interpreter for execution.
There are lots of text editors with additional features for coding, like syntax highlighting, automatic indentation, line
numbers. Two common ones are Kate301 (Linux, MacOS, Windows) and Notepad++302 (Windows). Others you
may hear about are Emacs303 , Vim304 and Nano305 , especially if you work on non-Windows machines (cloud!).
Fig. 22.6: Real programmers set the universal constants at the start such that the universe evolves to contain the disk
with the data they want. Source: Randall Munroe, xkcd.com/378306
code = None
if code == "":
print("To lazy to type?")
print("")
print("Bye")
Indentation matters! Python uses indentations (white space) to structure code. So don’t modify indentation depth
here.
Task: Open a terminal, go to your code file’s directory and run python bye.py.
The Python interpreter now runs your program. When reaching the end of the file, execution stops. Only output from
your program is shown. The Python interpreter itself doesn’t print anything as long as there are no errors in your
program.
301 https://kate-editor.org/
302 https://notepad-plus-plus.org
303 https://www.gnu.org/software/emacs/
304 https://www.vim.org/
305 https://www.nano-editor.org/
306 https://xkcd.com/378
22.3.3 Spyder
Spyder307 is a Python IDE for scientific programming with look and feel similar to Octave and Matlab. To use Spyder
install the spyder package with Anaconda Navigator or via conda install spyder in a terminal.
Next to text editor, interactive Python interpreter and separate plotting area Spyder provides tools for debugging (find
errors in programs) and for code profiling (measure execution time and memory consumption).
Task: Run Spyder (click button in Anaconda Navigator or type spyder in terminal).
Task: Run the Spyder tour (Help > Show Tour).
Task: Open bye.py in Spyder. Run the program by clicking the ‘run file’ button in the toolbar (triangle symbol).
Fig. 22.7: Spyder after running bye.py. The variable inspector shows that now there is a variable code set in the
interactive interpreter.
22.3.4 Executables
Python code can be bundled together with the interpreter and all required libraries into one executable file. This is
not recommended because this file will be relatively large, but it’s the only way to provide Python programs to people
which have not installed a Python interpreter on their machine.
There are several tools for this job. One is known as PyInstaller and ships with the Anaconda distribution. To convert
your Python source code file into an executable file open a terminal and type
Hint: Install the pyinstaller package via Anaconda Navigator or via conda install pyinstaller in
a terminal.
307 https://www.spyder-ide.org/
code = None
if code == "":
print("To lazy to type?")
print("")
print("Bye")
to be continued…
TWENTYTHREE
PYTHON PROGRAMMING
Programming is like riding a bike. If you want to learn it, you have to do it. Although nowadays there’s code available
for almost every standard task, implementing simple algorithms like searching and sorting ourselves is very instructive.
• Simple List Algorithms (page 297)
• Geometric Objects (page 300)
• Vector Multiplication (page 301)
In this project we implement simple algorithms related to lists like sorting a list or finding special values. The purpose
of the projecct is threefold:
• familiarize yourself with Python’s syntax,
• learn to algorithmize, that is, how to combine available building blocks to solve a task,
• see and understand how basic algorithms frequently used in data science work.
Before you work through the project you should have read Building Blocks (page 46). Restrict yourself to Python
features discussed there. Don’t use ready-made library functions.
Important: Don’t use list as name for a variable holding some list, although this would be quite expressive.
Several names like print and int and list are already occupied by Python. Python won’t complain about
reusing some of it’s predefined names as variables, but it’s considered bad practice.
# your solution
297
Data Science and Artificial Intelligence for Undergraduates
4. Test your code with several different sample lists. Include pathological cases like [1, 1, 1] and [1].
Solution:
# your solution
Task: What happens if you test your code with an empty list? Now add some code to your function to check whether
the list is empty. If the list is empty get_max should print a message and return 0.
Solution:
# your solution
# your solution
# your solution
Task: What happens if you test your code with an empty list? Now add some code to your function to check whether
the list is empty. If the list is empty get_mean should print a message and return 0.
Solution:
# your solution
Given a list of integers we want to count how often a given integer occurs in the list.
Task: Write a function count_value taking two arguments (the list and an integer) and returning the number of
occurrences of the integer in the list. Proceed step by step as before. How to handle empty lists here?
Solution:
# your solution
There exist plenty of algorithms for sorting308 values in lists. Here we consider selection sort309 for sorting a list of
integers.
Fig. 23.1: Selection sorts devides the list into two parts: sorted items and unsorted items. It repeatedly walks (blue)
through the unsorted items to find the smallest (red) unsorted item. Then it swaps the first unsorted item with the small-
est unsorted item. The swapped item then belongs to the sorted part (yellow). Source: Joestape89, wikipedia.org310 ,
CC BY-SA 3.0311 , modified by the author.
Task: Write a function sort taking a list and returning the sorted list. Proceed step by step as before. Don’t forget
to extensively test your code!
Solution:
# your solution
308 https://en.wikipedia.org/wiki/Sorting_algorithm
309 https://en.wikipedia.org/wiki/Selection_sort
310 https://en.wikipedia.org/wiki/File:Selection-Sort-Animation.gif
311 https://creativecommons.org/licenses/by-sa/3.0/deed.en
To get a better feeling for the object-oriented approach to programming we implement classes for geometric objects,
like roughly sketched in Everything is an Object (page 59). How to draw lines can be guessed from Library Code
(page 57). Generally, you should have completed the Crash Course (page 43) chapter before starting with this project.
We’ll need Matplotlib for this project. So we should import it. In principle, imports can be placed anywhere in the
code, but it’s considered good practice to have them in the first lines of a file.
23.2.1 Points
Task: Create a class Point with two member variables x and y. To create a Point object we want to call
Point(3, 4), for instance.
Solution:
# your solution
23.2.2 Triangles
Task: Create a class Triangle. For creating a Triangle object we want to pass three Point objects.
Solution:
# your solution
Task: Add a draw member function to Triangle which draws the triangle on screen using Matplotlib. Keep the
whole class definition in one code cell, because class definitions cannot be scattered over multiple cells.
Task: Create four points and draw two triangles between them (two points are used by both triangles). Remember
to call plt.show() to show the plot on screen. Without plt.show() the plot will show up, too, due to some
Jupyter magic. In plain Python you won’t see the plot.
Solution:
# your solution
23.2.3 Rectangles
Task: Create a class Rectangle representing an paraxial rectangle. For creating a Rectangle object we want to
pass two Point objects, the lower left corner and the upper right corner. Each of the rectangle’s four points should
be accessible as a member variable.
Solution:
# your solution
Task: Add a draw member function to Rectangle (use same code cell as before).
Task: Draw a series of 20 squares centered at the origin and growing from edge length 2 to 40. Call plt.
axis('equal') before plt.show() to make the squares look like squares.
Solution:
# your solution
23.2.4 Houses
Task: Create a class House representing a house made of a rectangular body and a triangular roof. The roof has
one third of the body’s height and is 20 per cent wider than the body. For creation we want to provide the lower left
corner as well as width of the body and total height of the house. Add a draw method.
Solution:
# your solution
# your solution
Task: Add a move method to Point which takes two numbers and moves the point paraxially by the specified
amounts. Then add move methods to Triangle, Rectangle and House, which call Point’s move method.
Task: Write a function row_of_houses for drawing a row of identical houses. Arguments are width and height
of the houses as well as start and end coordinates of the row on the x axis. Create one House object inside the
function. Draw and move the house in a while loop until the house leaves the specified interval.
Solution:
# your solution
# your solution
The aim of this project is to develop a class (object type) for 3-dimensional vectors. The class shall provide typical
calculations with vectors as Python operators. Before working through this project you should have read Variables
and Operators (page 75) chapter. For a recap of vector operations see Vectors (page 333).
This project consolidates your knowledge on defining custom object types and demonstrates the possibilities of
Python’s flexible approach to customization of operator’s behavior.
We start with a minimal orking example and then add more functionality step by step.
Task: Create a class Vector representing a vector in 3-dimensional space. The __init__ method takes the 3
components as arguments.
Solution:
# your solution
Task: Add dunder functions __str__ and __repr__ returning human readable representations of a Vector
object. Test your code by creating and printing a Vector object.
Solution:
23.3.2 Addition
When implementing operations on custom objects we have to decide whether to modify an existing object or to create
a new object holding the operation’s result. For vectors, a + b shouldn’t modify a, but return a new object.
Task: Implement vector addition with + operator. The operation should return a new object holding the result. If
the second operand is not of type Vector, addition should return NotImplemented. At this point we do not
need __radd__ because we only accept Vector objects for addition. So the left-hand side operand will always
be a Vector object and Python will call its __add__ method. Test your code!
Solution:
Task: Modify your implementation of addition to accept lists of length 3, too. Don’t forget to implement __radd__
now. Else list plus vector won’t work, because the left-hand side list object doesn’t know how to work with Vector
objects. Thus, Python tries to call __radd__ on the right-hand side operand. As always: test your code!
Solution:
Task: Implement multiplication of vectors by scalars via *. Both variants, scalar times vector and vector times scalar,
should be supported.
Solution:
Now we run out of operators. Of course, we could * again for inner products and check types to decide whether we
have to do multiplication by scalars or to compute an inner product. The more readable alternative is implementing
inner products without an operator.
Task: Implement a method inner taking a Vector or a list object as argument and returning the inner product
with the vector whose inner method has been called. If the argument neither is a vector nor a 3 element list, return
NotImplemented although the Python intepreter does not care about (because inner is a usual method, not an
operator). If you know about Python’s exception handling mechanism, you should raise NotImplementedError
here.
Solution:
Task: In analogy to inner implement a method outer returning the outer product of two vectors.
Solution:
23.3.6 Equality
At the moment == behaves like is. But we want to make == compare vectors componentwise.
Task: Implement equality test via == between two Vector objects and a vector and a list. If the second operand is
of incorrect type both operands are considered unequal.
Solution:
There’s a dunder method __getitem__ which is called by the Python interpreter whenever indexing syntax [...
] is applied to an object. The index is passed to the method and the method is expected to return the corresponding
item.
Task: Implement __getitem__. For invalid indices return None. If you know about Python’s exception handling
mechanism, you should raise IndexError here.
Solution:
Task: Make the len function work on Vector objects. Return value should be 3.
Solution:
TWENTYFOUR
WEATHER
We will work through several projects related to weather data and forecasting. Data will be obtained from Deutscher
Wetterdienst312 .
• DWD Open Data Portal (page 305)
• Getting Forecasts (page 307)
• Climate Change (page 309)
Deutscher Wetterdienst (DWD)313 is Germany’s public authority for collecting, managing and publishing weather
data from around the world. DWD also creates weather forecasts for Germany and all other regions of the world.
Some years ago DWD launched an Open Data Portal314 and continually extends its services there.
DWD’s open data portal provides lots of data and is very complex. In this project we explore part of its structure and
locate data sources for subsequent projects.
24.1.1 Licensing
We want to use DWD’s data for education and research. Before we delve into the data we should check whether we
are allowed to do this and whether and how to attribute the source.
Task: Find out whether we are allowed to use DWD’s data for education and research puposes.
Solution:
# your answer
# your answer
312 https://www.dwd.de
313 https://www.dwd.de
314 https://www.dwd.de/opendata
305
Data Science and Artificial Intelligence for Undergraduates
There are lots of weather stations at Germany collecting weather data. To locate wheather data on the map, we need
a list of stations and their geolocations.
Task: Find a list of all DWD weather stations at Germany measuring air temperature at least once per hour. The list
should containing geolocations (longitude, latitude, altitude) and other parameters. Get the URL, so we can download
it on demand.
Solution:
# your answer
Task: How many weather stations do we have in the list? List all parameters available for the stations.
Solution:
# your answer
Task: Get the station list for hourly precepitation measurements. How many stations do we have here?
Solution:
# your answer
24.1.3 Data
For each weather station we want to have access to all its historical and most recent measurements.
Task: Locate hourly temperature measurements for station Lichtentanne315 . What’s the most recent measurement
(timestamp and temperatur)? What’s the oldest measurement available?
Solution:
# your answer
24.1.4 Metadata
Each station comes with extensive metadata telling a story about the station and its measurements.
Task: Answer the following questions from metadata of Lichtentanne station:
• Did the station move? If yes, when? Was it’s name changed, too?
• Has measurement equipment been replaced? If yes, when?
• What’s the time zone of timestamps in the data?
Solution:
# your answer
The Open Data Portal316 of Deutscher Wetterdienst (DWD)317 provides detailed forecasts for Germany and all other
regions of the world in human and machine readable form. The machine readable service is called MOSMIX318 . In
this project we
• collect information on how to use MOSMIX,
• automatically download newly published MOSMIX data,
• convert MOSMIX files to CSV files.
In this project we heavily rely on techniques presented in Accessing Data (page 111).
DWD’s open data portal is quite complex. Before we start downloading forecasts data we have to find information
on data location and format.
Task: Read about MOSMIX at DWD’s MOSMIX info page319 . Follow relevant links and answer the following
questions:
• What are the differences between MOSMIX S and MOSMIX L?
• What’s the URL of the most recent MOSMIX L file for station ‘Zwickau’?
• What standard file formats are used for MOSMIX files (KMZ files)?
• How long MOSMIX files are available at DWD’s open data portal?
Solution:
# your answers
MOSMIX data older than two days gets removed from DWD’s open data portal. To be able to analyze quality of
forecasts (that is, to compare them to real observations) we have to keep them in a local archive. For this purpose we
would have to visit DWD’s open data portal once a day and look for new MOSMIX files. Then we could download
them and add them to our local archive. With Python we may automate this job.
Task: Write a function get_available_mosmix_files which scrapes a list of URLs of all currently available
MOSMIX L files for a selected station from DWD open data portal. Arguments:
• station ID (string).
Return value:
• URLs (list of strings).
Solution:
# your solution
Now it’s time to download the files. Maybe we already downloaded some of them yesterday. So we should have a
look in our archive directory first to avoid downloading more files than necessary.
Task: Write a function download_files which downloads all new files from a list of URLs. Arguments:
316 https://www.dwd.de/opendata
317 https://www.dwd.de
318 https://www.dwd.de/EN/ourservices/met_application_mosmix/met_application_mosmix.html
319 https://www.dwd.de/EN/ourservices/met_application_mosmix/met_application_mosmix.html
# your solution
Now that we have MOSMIX files in our local storage we should convert them to CSV files. Each row shall contain
all weather parameters for a fixed point of time. First column is the time stamp. All other columns contain all the
weather parameters contained in the MOSMIX files.
Task: Write a function kmz_to_csv for converting a list of KMZ files to CSV files. Arguments:
• archive path (string),
• list of file names (list of strings).
No return value.
Hint: MOSMIX files use an XML feature known as namespaces. Consequently, tag names contain collons, which
confuses Beautiful Soup’s standard HTML parser (which also parses simple XML files). To get MOSMIX files parsed
correctly, install the lxml module and provide a second argument 'xml' to Beautiful Soup’s constructor. This tells
Beautiful Soup to use a dedicated XML parser, which by default is lxml.
Solution:
# your solution
To collect forecasts over a longer period of time we have to run the developed code once per day. We could implement
a loop and use time.sleep to make Python wait one day before continuing with the next run. The better (simpler
and more efficient) solution is to tell the operating system to run the Python program each day at a fixed time.
On Linux and macOS there is cron (and anacron) for scheduling tasks. On Windows there is the Task Scheduler.
Task: Find out the details about scheduling a daily task on your system. Then make a Python script file from your
code above and let it run once per day.
Solution:
320 https://docs.python.org/3/library/os.path.html#os.path.isfile
In this project we download historic weather data from DWD Open Data Portal321 and have a look at annual mean
temperatures and other values at different locations in Germany.
In this project we heavily rely on techniques presented in Accessing Data (page 111) and High-Level Data Management
with Pandas (page 187) as well as on knowledge obtained in the DWD Open Data Portal (page 305) project.
We use the DWD data set Historical daily station observations for Germany322 , see description323 .
# your solution
Task: Get a list of file names of all ZIP files of the data set.
Hint: A good idea is to construct file names from data in the station list (ID, first and last day of measurement). But
it turns out that dates in the list in the file names do not coincide for several files. Thus, we have to scrape file names
from the data set’s file listing326 .
Solution:
# your solution
climate_daily_kl_historical_en.pdf
324 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/KL_Tageswerte_Beschreibung_
Stationen.txt
325 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html
326 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/daily/kl/historical/
4. Write data to a CSV file (one large CSV file for data from all stations).
Columns for CSV file:
• date (timestamp of measurement),
• id (station ID, integer),
• 'wind_gust', 'wind_speed', 'precipitation', 'sunshine', 'snow',
'clouds', 'pressure', 'temperature', 'humidity', 'max_temp', 'min_temp',
'min_temp_ground' (float).
Solution:
# your solution
Dates of first and last measurements are incorrect in the station list created above. Now, that we have the measure-
ments, we should correct the list.
Task: For each station get dates for first and last measurement and write them to the station list CSV file. Drop all
stations that do not have any measurements.
# your solution
24.3.4 Plots
# your solution
TWENTYFIVE
A major application of data science and artificial intelligence is recognition of handwritten characters. I a series of
projects we will implement different techniques for this task based on the famous MNIST data set (and related data
sets) for training recognition systems. MNIST is provided by the National Institute of Standards and Technology327
• The xMNIST Family of Data Sets (page 311)
• Load QMNIST (page 313)
In the project we have a first look at the MNIST data set and related data sets. In subsequent projects we’ll use these
data sets for training machine learning models.
A major benefit from the project is, that we see how difficult data preparation can be. As we’ll learn later on, obtaining
unbiased data is extremely important for training machine learning algorithms.
# your answers
327 https://www.nist.gov
328 https://www.nist.gov/system/files/documents/srd/nistsd19.pdf
329 https://www.nist.gov/srd/nist-special-database-19
311
Data Science and Artificial Intelligence for Undergraduates
25.1.2 MNIST
# your answers
25.1.3 QMNIST
# your answers
25.1.4 EMNIST
# your answers
In this project we develop a Python module for loading and preprocessing QMNIST images and metadata. Prereq-
uisites:
• Efficient Computations with NumPy (page 155)
• The xMNIST Family of Data Sets (page 311)
Task: Get QMNIST training and test data from QMNIST GitHub repository336 (4 files ending with ...
idx3-ubyte.gz or ...idx2-int.gz) and find information on the file format.
Task: Write a function load which reads images and metadata from the QMNIST files. Parameters:
• path: defaulting to '', path of directory with data files.
• subset: defaulting to 'train' (load training data), passing 'test' loads test data.
• as_list: defaulting to False (return one large array), passing True returns a list of images.
Return values:
• NumPy array of shape (60000, 28, 28) or list of 60000 NumPy arrays of shape (28, 28) (range
0…1, type float16), depending on parameter as_list.
• NumPy array of shape (60000, ) containing classes (type uint8).
• NumPy array of shape (60000, ) containing series IDs (type uint8).
• NumPy array of shape (60000, ) containing writer IDs (type uint16).
334 https://arxiv.org/pdf/1702.05373.pdf
335 https://www.nist.gov/itl/products-and-services/emnist-dataset
336 https://github.com/facebookresearch/qmnist
Test your function and show first and last images of training and test data. Print corresponding metainformation. You
may use the code from Image Processing with NumPy (page 271) to show images.
Hint: Going the obvious path via zipfile module and np.fromfile fails due to two problems:
1. Python’s zipfile module has some trouble reading the QMNIST files. Try the gzip module337 from
Python’s standard library instead.
2. NumPy’s fromfile is not compatible with file objects created by the gzip module. The fromfile
function will read compressed instead of uncompressed data (for some very knotty technical reasons). Thus,
read with the file object’s read method and use np.frombuffer.
Solution:
# your solution
25.2.2 Preprocessing
Before images can be used preprocessing steps might be appropriate. Given a list of preprocessing steps we would
like to have a function which applies all the steps to all images.
Task: Write a function preprocess which applies a list of preprocessing steps to all images. Parameters:
• images: large NumPy array or list of arrays (images to be processed).
• steps: list of functions; each function takes an image and returns an image.
• as_list: False (default) returns images in large array (and fails if image sizes differ after applying pre-
processing steps); True returns list of images.
Return values:
• list of processed images or large array of images, depening an parameter as_list.
Test your code with two preprocessing steps:
1. horizontal mirrowing,
2. color inversion (black to white, white to black).
Solution:
# your solution
Task: Create a Python module qmnist.py providing both functions load and preprocess.
Solution:
# your solution
337 https://docs.python.org/3/library/gzip.html
TWENTYSIX
CAFETERIA
Have a look at the Zwickau and Chemnitz Universities’s menu338 (cafeterias of both universities are operated by
Studentenwerk Chemnitz-Zwickau339 ). In this project we want to scrape as much as possible historic menu data
from that website. Read Accessing Data (page 111) before you start. Section Web Access (page 122) is of particular
importance.
Often web APIs come with some documentation. In our case we neither see an obvious API nor some documenta-
tion. Clicking through the menus of past weeks and watching the browser’s address bar we see how date and other
information is encoded in the URL. This is our key for scraping historic data.
In addition, there is a link an the lower right looking like information about the API. But it turns out, that there is not
much API related information, but the useful hint on on XML interface340 using the same parameter envoding like
the HTML interface.
Task: Understand the arguments in the HTML URLs. Then try the XML API from your browser’s address bar.
Note all location IDs (for ‘Mensa Ring’ and so on) and the oldest available menu (by trial and error).
Solution:
# your answer
Have a look at the license information341 . There we read that it’s okay to use the data for our intended purposes.
Remember to not fire too many requests in short time to the server! This may trigger some protection mechanism
making the server refuse any communication with us.
• Limit the number of requests per second by pausing your script after each request.
• While developing and testing automatic download limit the total number of requests to a hand full until you’re
certain that your script works correctly.
338 https://www.swcz.de/bilderspeiseplan
339 https://www.swcz.de
340 https://www.swcz.de/bilderspeiseplan/xml.php
341 https://www.swcz.de/bilderspeiseplan/lizenz.php
315
Data Science and Artificial Intelligence for Undergraduates
# your solution
26.4 Parsing
Task: From all the downloaded files extract all meals including date, category, description, and prices for students,
staff, guests. Save the data to a CSV file.
Solution:
# your solution
TWENTYSEVEN
PUBLIC TRANSPORT
In this series of projects we visualize and analyze public transport networks based on open data.
• Get Data and Set Up the Environment (page 317)
• Find Connections (page 322)
In this project we download public transport data and install several Python packages for its processing. Some basic
knowledge in Python programming is required for this project.
Timetable data for public transport operators in Germany is available in GTFS format342 .
Task: Go to gtfs.de343 . Find available GTFS feeds. What types of transport are contained in each feed? What time
periods are covered by the data? Are we allowed to use the data?
Solution:
# your answers
Task: Download all available data from gtfs.de344 . Note download URLs and terminal commands (if you use the
terminal).
Solution:
# your notes
342 https://en.wikipedia.org/wiki/GTFS
343 https://www.gtfs.de
344 https://www.gtfs.de
317
Data Science and Artificial Intelligence for Undergraduates
To compute walking distances between neighboring public transport stops we’ll use data from OpenStreetMap
(OSM)345 . The OSM website provides download of (too) small regions or the whole planet (about 60 GB). Geo-
fabrik GmbH346 provides regional downloads.
Task: Check OSM licence information. Then download OSM data for Europe in PBF format (Germany is not
enough, because GTFS data may contain stops in neighboring countries, if German trains cross borders). Note the
download URL and terminal commands.
Solution:
# your notes
Extracting walking distances from OSM data requires a lot of memory. Memory consumption grows with size of the
region under consideration. Thus, we should extract our region of interest from Europe’s OSM file.
Task: Find minimum and maximum latitude and longitude of your region of interest (go to OSM and look at the
coordinates of some object on the border of your region of interest).
Solution:
# your answer
There exist many tools for processing OSM data. A very handy one is Osmosis347 . You may use it as Python package
or in terminal. The terminal command for data extraction is
Task: Extract your region of interest with Osmosis. Note the full terminal command.
Solution:
# your notes
We want to use the gtfspy348 Python package. It’s unmaintained since 2019 (at least). Thus, installation is tricky
due to outdated dependencies. But it’s a nice package including fast public transport routing. It has been developed for
creating A collection of public transport network data sets for 25 cities349 (also see corresponding GitHub repo350 ).
To avoid messing up your everyday Conda environment with failed installations and broken dependencies create a
new Conda environment for this project.
Task: Create a new Conda environment gtfs. If working on Gauss351 , don’t forget to create a corresponding
ipykernel for Jupyter and to switch your notebook’s kernel to the new one.
Solution:
345 https://www.osm.org
346 http://www.geofabrik.de/
347 https://wiki.openstreetmap.org/wiki/Osmosis
348 https://github.com/CxAalto/gtfspy
349 https://www.nature.com/articles/sdata201889
350 https://github.com/CxAalto/gtfs_data_pipeline
351 https://gauss.fh-zwickau.de
# your notes
The gtfspy package depends on osmread352 package. But osmread isn’t available via Conda. Via PyPI (that
is, pip) we get an older version with outdated (unsatisfyable) dependencies. Thus, we have to install osmread from
source.
Task: Find out what the following commands do. For each line write a short comment. Then run the commands
(works on Linux, macOS and Co.; for Windows minor modifications may be required).
Solution:
# your notes
The gtfspy package comes with outdated dependencies and several programming errors. Thus, we install it from
source as a local package in our working directory. This way we may easily fix issues when they pop up.
Task: Find out what the following commands do. Why do we need the mv commands? For each line write a short
comment. Then run the commands (works on Linux, macOS and Co.; for Windows minor modifications may be
required).
pip install pandas networkx pyshp nose Cython shapely pyproj mopy geoindex geojson␣
↪matplotlib-scalebar
Solution:
# your notes
The gtfspy package uses several outdated library functions (mainly from networkx package) and contains some
programming errors. Some patching is in order…
Task: Implement the modifications listed below and think about why they could be necessary (make short notes).
Solution:
# your notes
352 https://github.com/dezhin/osmread
in gtfspy/osm_tranfer.py:
• replace (line 91)
network_nodes = walk_network.nodes(data="true")
by
network_nodes = walk_network.nodes(data=True)
walk_network.add_path(way.nodes)
by
networkx.add_path(walk_network, way.nodes)
by
nodes_to_remove = []
good_nodes = networkx.get_node_attributes(walk_network, 'lat').keys()
for node, degree in walk_network.degree():
if degree == 0:
nodes_to_remove.append(node)
elif node not in good_nodes:
nodes_to_remove.append(node)
for node in nodes_to_remove:
walk_network.remove_node(node)
(good_nodes contains all nodes with lat/lon data; nodes without data presumably belong to ways crossing the
map’s border (some nodes dropped by Osmosis, but way not shortened); prevents index errors when computing
edge lengths some lines below)
in gtfspy/networks.py:
• replace (lines 267-270):
events_df.drop('to_seq', 1, inplace=True)
events_df.drop('shape_id', 1, inplace=True)
events_df.drop('duration', 1, inplace=True)
events_df.drop('route_id', 1, inplace=True)
by
by
To speed up routing gtfspy stores all data in an SQLite353 data base. That’s a usual file with extension sqlite.
First step in working with gtfspy is to create the data base containing all relevant GTFS feeds.
Task: Have look at the import_gtfs function in gtfspy’s import_gtfs module. Use this function to
transfer GTFS feeds of interest to you to an SQLite data base.
Solution:
# your solution
If imported GTFS data covers a much larger region than the region you are interested in, you should filter the created
data base by region. Else, routing becomes too expensive (in terms of computation time). The gtfspy package
provides such filtering, but it’s expensive, too. Thus, filtering should only be used if it reduces the data base’s size
significantly.
Filtering require three steps:
1. Open the data base to filter by creating a GTFS object, defined in gtfspy’s gtfs module.
2. Create a FilterExtract object, defined in gtfspy’s filter module.
3. Call the FilterExtract object’s filter method.
Task: Have look at gtfspy’s source to learn how to use the above mentioned objects and functions. Then filter the
data base by region (hint: ‘buffer zone’ in gtfspy's source is the region of interest).
Solution:
# your solution
To get more realistic walking times between neighboring stops we may extract walking distances from Open-
StreetMap. This step is optional. It requires a lot of memory and computation time, because the whole walk network
(all walkable paths and streets) is extracted from the OSM file. Use OSM walking distances for small regions only.
Without OSM data Euclidean distance are used.
Task: Have look at add_walk_distances_to_db_python in gtfspy’s osm_transfer module. Then
use this function to get OSM walking distances. If your region is too large, have a look at hint below this task.
Solution:
# your solution
353 https://www.sqlite.org
Hint: Without OSM walking distances the routing algorithm will complain about missing the key d_walk in a
dictionary. That’s presumably a bug. Workaround: Whenever you use your data base (without OSM distances) for
routing, add the following lines to your code:
Here walk_network is an object representing the walk network stored in the data base. It will be created as
preparative step for routing and then passed to the routing algorithm. Place the code between creation of the walk
network and passing the walk network to the routing algorithm.
If you use these two lines of code with OSM distance, OSM distances will be overwritten with Euclidean distances.
To use the SQLite data base we have to create a GTFS object, definded in gtfspy’s gtfs module. This object
then provides lots of methods for accessing the data.
Task: Have a look at an GTFS objects stops, get_min_date, get_max_date methods. Call them to get a
list of all stops and the date range covered by the GTFS data.
Solution:
# your solution
In this project we generate departure times for all stops in a region of interest for connections to one arrival stop with
fixed (latest) arrival time.
The projects uses the gtfspy data base created in the Get Data and Set Up the Environment (page 317) project.
Basic Pandas knowledge is required to solve the tasks (read Series (page 188), Data Frames (page 198), Advanced
Indexing (page 207) before you start, Performance Issues (page 233) may be of interest, too).
Task: Connect to the data base, that is, create a gtfspy.gtfs.GTFS object.
Solution:
# your solution
The routing algorithm of gtfspy looks for public transport connections in a user-defined time frame. Start and end
time have to be provided in Unix time354 .
Task: Compute Unix times for start and end of your time frame of interest. Use the GTFS object’s
get_day_start_ut method to convert a date to it’s 00:00 unix time. Then add hours and minutes to this value.
Hint: The Python standard library provides functions for getting Unix times. But GTFS.get_day_start_ut
takes care of time zone information in the GTFS data.
Solution:
354 https://en.wikipedia.org/wiki/Unix_time
# your solution
The routing algorithm of gtfspy computes public transport connections from all stops in the data base to a user-
defined arrival stop. The arrival stop has to be specified by it’s GTFS ID (column 'stop_I' in the data frame
returned by GTFS.stops()).
Task: Get the stops data frame. Use column 'stop_I' (GTFS stop ID) as index. Rename the index column
to 'id' and the column 'stop_id' to 'code' (the stop’s GTFS short name). Drop all columns but 'id',
'code', 'name', 'lat', 'lon'.
Solution:
# your solution
Task: Write some code to find all stops containing some string (e.g., all stops containing 'Zwickau, Zentrum').
Use the stops’ geolocation and OpenStreetMap to decide for an arrival stop.
Hint: An advanced and very comfortable solution is to generate for each relevant stop a link to OSM (with marker
at the stop). Rendering these links as HTML in Jupyter you simply have to click the stops’ links to see where they
are on the map.
• OSM link with marker: https://www.osm.org/?mlat=MARKER_LAT&mlon=MARKER_LON
• HTML rendering for links:
import IPython.display
display(IPython.display.HTML('<a href="URL">LINK_TEXT</a>'))
Solution:
# your solution
27.2.3 Routing
The routing API of gtfspy is relatively complex and unintuitive. To generate all connections to the arrival stop
following steps are necessary:
1. Call gtfspy.routing.helpers.get_transit_connections.
2. Call gtfspy.routing.helpers.get_walk_network(G, max_walk).
3. Create a gtfspy.routing.multi_objective_pseudo_connection_scan_profiler.
MultiObjectivePseudoCSAProfiler object. Pass the results of steps 1 and 2 to the constructor
(arguments transit_events and walk_network).
4. Call the run method of the object created in step 3.
Task: Follow the above steps. Have a look at gtfspy’s source for available arguments. A good walking speed
is 1.5. With track_vehicle_legs and track_time you (presumably) can influence whether connections
with fewer transfers and lower travel time shall be preferred by the routing algorithm.
Solution:
# your solution
The MultiObjectivePseudoCSAProfiler object now contains information about all connections to the
arrival stop in the specified time frame. The stop_profiles member variable is subscriptable with allowed
indices returned by the keys member function. Indices are stop IDs. If i is a stop ID, then stop_profiles[i].
get_final_optimal_labels() returns an iterable object with one item per connection from stop i to the
arrival stop. Each item has a departure_time member containing the departure time of the connection in Unix
time.
Task: Add a column to your stops data frame, which contains the difference between latest allowed arrival time and
latest possible departure time from the considered stop in minutes. For stops without connection to the arrival stop
use -1.
Solution:
# your solution
In the stops data frame most stops appear multiple times, e.g., each platform of a station has its own item
in the data frame. For visualization nearby stops should be merged to one stop. The GTFS object’s
get_stops_within_distance method yields a data frame of nearby stops. The first argument is the con-
sidered stop’s ID, the second argument is the distance in meters.
Task: Think about an algorithm for grouping stops and implement it. Add a column to your stops data frame, which
contains a group ID for each stop. All stops with identical group ID are considered one and the same stop (in the
visualization to create in a follow-up project).
Solution:
# your solution
Task: How many stop groups do you have? What’s the largest group? Show all its stops.
Solution:
# your solution
# your solution
TWENTYEIGHT
CORONA DEATHS
In this project we collect and/or compute death rates before and during the Corona pandemic in Germany. You should
read Dates and Times (page 217) before you start.
We would like to have monthly death rates for an as long as possible period of time including very recent data.
Task: Download relevant data from Federal Statistical Office (Statistisches Bundesamt)355 in CSV format:
• Destatis, table 12613-0006356
• Destatis, table 12411-0001357
• Destatis, Sonderreihe mit Beiträgen für das Gebiet der ehemaligen DDR, Heft 3358
• Destatis, table 12411-0020359
For each file write a short note on its content.
Solution:
# your notes
Task: Use your favorit spreadsheet tool to compile following CSV files from the downloaded files:
• inhabitants-yearly.csv with columns year, FRG (inhabitants FRG), GDR (inhabitants GDR, 0
from 1990 on)
• inhabitants-quarterly.csv with colums date, inhabitants
• deaths-monthly.csv with columns year, months (numeric 1…12), men, women
We want to use dates as index for data frames. Numbers of inhabitants are related to precise timestamps (end of year
or quarter). Numbers of deaths are related to periods (month).
Task: Read in the three CSV files. Use DatetimeIndex and PeriodIndex for data frames and series. In the
end you should have two series:
• inhabitants with index date (timestamp of last day in year or quarter),
• deaths with index date (monthly period aligned at last day of month).
355 https://www.destatis.de
356 https://www-genesis.destatis.de/genesis//online?operation=table&code=12613-0006
357 https://www-genesis.destatis.de/genesis//online?operation=table&code=12411-0001
358 https://www.statistischebibliothek.de/mir/servlets/MCRFileNodeServlet/DEMonografie_derivate_00000961/Heft_3.pdf
359 https://www-genesis.destatis.de/genesis//online?operation=table&code=12411-0020
325
Data Science and Artificial Intelligence for Undergraduates
Solution:
# your solution
For calculating monthly death rates we have to get the number of inhabitants on a monthly basis, i.e., the mean number
of inhabitants per month. If we would have daily values for the number of inhabitants we could simply calculate the
mean. But resolution is much coarser. Thus, we have to use (linear) interpolation. A good replacement for the
monthly mean is the (interpolated) value at the 15th of the month.
Task: Use resampling to get interpolated number of inhabitants at the 15th of each month. From these values
construct a series with period index (in analogy to the deaths series’ index). Hint: instead of (integer) index based
linear interpolation you may want to use timestamp based interpolation (see docs).
# your solution
Task: Calculate monthly death rates and plot results with Series.plot().
# your solution
Mathematics
327
CHAPTER
TWENTYNINE
LOGIC
29.1.1 Not
𝑎 not 𝑎
true false
false true
29.1.2 And
The and operator yields true if and only if both operands are true.
𝑎 𝑏 𝑎 and 𝑏
true true true
true false false
false true false
false false false
29.1.3 Or (inclusive)
The or operator yields true if and only if at least one of both operands is true. This is sometimes called inclusive or
because the and-case (both operands true) is included (that is, yields true).
𝑎 𝑏 𝑎 or 𝑏
true true true
true false true
false true true
false false false
329
Data Science and Artificial Intelligence for Undergraduates
29.1.4 Or (exclusive)
The xor operator yields true if and only if exactly one of both operands is true. This called exclusive or because the
and-case is excluded.
𝑎 𝑏 𝑎 xor 𝑏
true true false
true false true
false true true
false false false
THIRTY
COMBINATORICS
Combinatorics is the mathematical field of counting. An application are discrete distributions in simple probability
theory.
30.1 Factorial
𝑛! ∶= 1 ⋅ 2 ⋅ ⋯ ⋅ (𝑛 − 1) ⋅ 𝑛,
where 0! ∶= 1.
Obviously, 𝑛! = (𝑛 − 1)! 𝑛.
331
Data Science and Artificial Intelligence for Undergraduates
THIRTYONE
LINEAR ALGEBRA
This chapter summarizes tools and results from linear algebra used throughout the book.
• Vectors (page 333)
• Matrices (page 334)
• Systems of Linear Equations (page 336)
31.1 Vectors
The term vector is used in several different contexts, each coming with a slightly different definition. In basic linear
algebra a vector often is considered a finite column of numbers. That’s the approach we follow here.
31.1.1 Definition
For 𝑑 ∈ ℕ a 𝑑-tuple is an ordered list of real numbers, typically written as (𝑥1 , 𝑥2 , … , 𝑥𝑑 ) with 𝑥1 , … , 𝑥𝑑 denoting
the numbers. Here ‘ordered’ means that swapping two unequal numbers in the list yields a different 𝑑-tuple. Example:
(1, 2, 3) ≠ (1, 3, 2). By ℝ𝑑 we denote the set of 𝑑-tuples.
Vector is another term for 𝑑-tuple. In linear algebra vectors may be interpreted as points in space or as difference
between two points (that is, describing a translation). Vectors often are written as columns:
𝑥1
⎡𝑥 ⎤
𝑥 = ⎢ 2⎥ .
⎢ ⋮ ⎥
⎣𝑥𝑑 ⎦
333
Data Science and Artificial Intelligence for Undergraduates
𝑥1 + 𝑦1
𝑥 + 𝑦 ∶= ⎡
⎢ ⋮ ⎤.
⎥
⎣𝑥𝑑 + 𝑦𝑑 ⎦
Products of real numbers and vectors are defined componentwise. For 𝑎 ∈ ℝ and 𝑥 ∈ ℝ𝑑 we have
𝑎 𝑥1
𝑎 𝑥 ∶= ⎡
⎢ ⋮ ⎥.
⎤
⎣𝑎 𝑥𝑑 ⎦
⟨𝑥, 𝑦⟩ ∶= 𝑥1 𝑦1 + ⋯ + 𝑥𝑑 𝑦𝑑 .
𝑥2 𝑦3 − 𝑥3 𝑦2
𝑥 × 𝑦 ∶= ⎡ ⎤
⎢𝑥3 𝑦1 − 𝑥1 𝑦3 ⎥ .
⎣𝑥1 𝑦2 − 𝑥2 𝑦1 ⎦
The outer product yields a vector orthogonal to both factors.
31.2 Matrices
A matrix is a rectangular scheme of numbers. Matrices frequently appear in almost all fields of mathematics be-
cause they can be used to represent abstract concepts like linear mappings numerically. Many abstract operations in
mathematics boil down to matrix computations as soon as concrete numerical examples are considered.
31.2.1 Definition
For 𝑚 ∈ ℕ and 𝑛 ∈ ℕ a matrix is an 𝑚-tuple of 𝑛-tuples (see Vectors (page 333) for definition of tuples). Example:
1 4 5 −2 1 4 5 −2
⎛
⎜3 2 7 10 ⎞
⎟ or ⎡3 2 7 10 ⎤
⎢ ⎥
⎝−2 4 5 8⎠ ⎣−2 4 5 8⎦
The set of all realvalued matrices with 𝑚 rows and 𝑛 columns is denoted as ℝ𝑚×𝑛 .
Matrices usually are denoted by uppercase letters and a matrix’ elements by corresponding lowercase letters with
double index. The first index denotes the row, the second the column of the element. Example:
Row 𝑖 is denoted by 𝑎𝑖• , column 𝑗 by 𝑎•,𝑗 . Rows and columns can be regarded as vectors.
A matrix with identical number of rows and columns (𝑚 = 𝑛) is called square matrix. The tuple (𝑎11 , 𝑎22 , … , 𝑎𝑚𝑚 )
is called main diagonal of the matrix 𝐴.
A square matrix of the form
1 0 ⋯ 0 0
⎛
⎜ 0 1 ⋱ 0⎞⎟
⎜
⎜ ⎟
⎜ ⋮ ⋱ ⋱ ⋱ ⋮⎟⎟
⎜
⎜0 ⎟
⋱ 1 0⎟
⎝0 0 ⋯ 0 1⎠
is called identity matrix.
Matrices of the form
∗ ⋯ ⋯ ∗ ∗ 0 ⋯ 0
⎛
⎜0 ⋱ ⋮⎞⎟ ⎛
⎜ ⋮ ⋱ ⋱ ⋮⎞⎟
⎜
⎜⋮ ⎟ and ⎜ ⎟
⋱ ⋱ ⋮⎟ ⎜⋮ ⋱ 0⎟
⎝0 ⋯ 0 ∗⎠ ⎝∗ ⋯ ⋯ ∗ ⎠
are called upper-triangular and lower-triangular.
31.2.3 Transpose
that is, the same matrix as 𝐴, but with rows and columns interchanged.
The double transpose is equivalent to the original matrix:
T
(𝐴T ) = 𝐴.
The product of two matrices 𝐴 ∈ ℝ𝑚×𝑛 and 𝐵 ∈ ℝ𝑛×𝑝 is the matrix 𝐶 ∶= 𝐴 𝐵 ∈ ℝ𝑚×𝑝 with entries
𝑛
𝑐𝑖𝑘 ∶= ∑ 𝑎𝑖𝑗 𝑏𝑗𝑘 .
𝑗=1
A square matrix 𝐴 ∈ ℝ𝑛×𝑛 is called invertible if there is a square matrix 𝐵 ∈ ℝ𝑛×𝑛 such that
𝐴𝐵 = 𝐼 and 𝐵 𝐴 = 𝐼,
31.2.6 Determinants
The determinant det 𝐴 of a square matrix 𝐴 ∈ ℝ𝑛×𝑛 is the real number computed from the following iterative rule:
⎧𝑎1,1 , if 𝑛 = 1,
{ 𝑛
det 𝐴 ∶= ⎨
∑(−1)1+𝑗 𝑎1,𝑗 det 𝐴1,𝑗 , if 𝑛 > 1,
{
⎩ 𝑗=1
where 𝐴1,𝑗 is the submatrix of 𝐴 originating from removing row 1 and column 𝑗.
The determinant is non-zero if and only if the matrix is invertible. It is positive if the matrix columns constitute a
right-handed coordinate system and negative if columns constitute a left-handed coordinate system.
For each fixed 𝑖 ∈ {1, … , 𝑛} we get an equivalent iterative definition:
𝑛
det 𝐴 = ∑(−1)𝑖+𝑗 𝑎𝑖,𝑗 det 𝐴𝑖,𝑗 for 𝑛 > 1,
𝑗=1
A system of linear equations with 𝑚 equations and 𝑛 unknowns 𝑥1 , … , 𝑥𝑛 has the form
𝑎1,1 𝑥1 + ⋯ + 𝑎1,𝑛 𝑥𝑛 = 𝑏1
⋮
𝑎𝑚,1 𝑥1 + ⋯ + 𝑎𝑚,𝑛 𝑥𝑛 = 𝑏𝑚
𝐴 𝑥 = 𝑏.
Depending on coeefficient matrix 𝐴 and right-hand side 𝑏 a system of linear equations either has no solution or exactly
one solution or infinitely many solutions.
Software Development
337
CHAPTER
THIRTYTWO
There’s a standard for visualizing the design of software and other systems: unified modeling language (UML). Next
to many other types of visualization it standardizes how to express relations between classes graphically. Especially,
inheritance relations can be visualized. We do not go into the details here. But you should know that there is a
standard and from time to time you should practice reading UML class diagrams, since they are used for planning
and communicating larger software projects.
To get an overview of UML class diagrams have a look at Class diagram360 at Wikipedia.
An open source tool for drawing UML class diagrams is UMLet361 .
Other types of diagrams are shown in Wikipedia’s article Unified Modeling Language362 .
360 https://en.wikipedia.org/wiki/Class_diagram
361 https://www.umlet.com
362 https://en.wikipedia.org/wiki/Unified_Modeling_Language
339