Engineering Practices Web
Engineering Practices Web
Engineering Practices Web
Python
Docker
ENGINEERING PRACTICES
for Data Scientists
1
Contents
Foreword 3
Git 4
Python dependencies 12
Docker 19
Final takeaways 54
Juha Kiili
Senior Software Developer, Product Owner at Valohai
Senior Software Developer with gaming industry background
shape-shifted into a full-stack ninja. I have the biggest monitor.
2
Foreword
Software engineering has come a long way. It’s no longer just about getting a
functioning piece of code on a floppy disk; it’s about the craft of making software.
There’s a good reason for it too. Code lives for a long time.
Thus there are a lot of strong opinions about good engineering practices that
make developing software for the long haul possible and more enjoyable. I think
enjoyability is an important word here because most software developers know
the pain of fixing poorly developed and poorly documented legacy software.
Data scientists are also entering this world because machine learning is becoming
a core part of many products. While a heterogenous bunch with various
backgrounds, data scientists are more commonly from academia and research
than software engineering. The slog of building and maintaining software isn’t
as familiar as it is to most developers, but it will be soon enough. It’s better to be
prepared with a solid foundation of best practices, so it’ll be easier to work with
software engineers, and it’ll be easier to maintain what you build.
This eBook is to help pick up engineering best practices with simple tips. I hope
that we can teach even the most seasoned pros something new and get you talking
with your team on how you should be building things. Remember, as machine
learning becomes a part of software products, it too will live for a long time.
This eBook isn’t about Valohai – although there is a section about our MLOps
platform at the end – but good engineering is close to our heart.
3
Git
What is Git?
Usually, there is a single central repository (called "origin" or "remote") which the
individual users will clone to their local machine (called "local" or "clone"). Once the
users have saved meaningful work (called "commits"), they will send it back ("push"
and "merge") to the central repository.
Git is the underlying technology and its command-line client (CLI) for tracking
and merging changes in a source code.
GitHub is a web platform built on top of git technology to make it easier. It also
offers additional features like user management, pull requests, automation. Other
alternatives are for example GitLab and Sourcetree.
Terminology
• Repository – "Database" of all the branches and commits of a single project
• Branch – Alternative state or line of development for a repository.
• Merge – Merging two (or more) branches into a single branch, single truth.
• Clone – Creating a local copy of the remote repository.
• Origin – Common alias for the remote repository which the local clone was
created from
4
• Main / Master – Common name for the root branch, which is the central
source of truth.
• Stage – Choosing which files will be part of the new commit
• Commit – A saved snapshot of staged changes made to the file(s) in the
repository.
• HEAD – Shorthand for the current commit your local repository is currently on.
• Push – Pushing means sending your changes to the remote repository for
everyone to see
• Pull – Pulling means getting everybody else's changes to your local repository
• Pull Request – Mechanism to review & approve your changes before merging
to main/master
Basic commands
• git push (Documentation) – Send your saved snapshots (commits) into the
remote repository.
• git pull (Documentation) – Pull recent commits made by others into your
local computer.
5
Rules of thumb for Git
There are extensions like LFS that refer to external datasets from a git repository.
While they serve a purpose and solve some of the technical limits (size, speed),
they do not solve the core problem of a code-centric software development
mindset rooted in git.
You will always have datasets floating around in your local directory though. It
is quite easy to accidentally stage and commit them if you are not careful. The
correct way to make sure that you don't need to worry about datasets with git is
to use the .gitignore config file. Add your datasets or data folder into the
config and never look back.
Example:
# ignore archives
*.zip
*.tar
*.tar.gz
*.rar
# ignore dataset folder and subfolders
datasets/
6
Don't push secrets
This should be obvious, yet the
constant real-world mistakes
prove to us it is not. It doesn't
matter if the repository is private
either. In no circumstances should anyone commit any
username, password, API token, key code, TLS certificates, or
any other sensitive data into git.
Even private repositories are accessible by multiple accounts and are also cloned
to multiple local machines. This gives the hypothetical attacker exponentially more
targets. Remember that private repositories can also become public at some point.
Decouple your secrets from your code and pass them using the environment
instead. For Python, you can use the common .env file with which holds the
environment variables, and the .gitignore file which makes sure that the
.env file doesn't get pushed to the remote git repository. It is a good idea to also
provide the .env.template so others know what kind of environment variables
the system expects.
.env:
API_TOKEN=98789fsda789a89sdafsa9f87sda98f7sda89f7
.env.template:
API_TOKEN=
.gitignore:
.env
hello.py:
from dotenv import load_dotenv
load_dotenv()
api_token = os.getenv('API_TOKEN')
This still requires some manual copy-pasting for anyone cloning the repository for
the first time. For more advanced setup, there are encrypted, access-restricted
tools that can share secrets through the environment, such as Vault.
Note: If you already pushed your secrets to the remote repository, do not try to
fix the situation by simply deleting them. It is too late as git is designed to be
immutable. Once the cat is out of the bag, the only valid strategy is to change the
passwords or disable the tokens.
7
Don't push notebook outputs
Notebooks are cool because they let
you not only store code but also
the cell outputs like images, plots,
tables. The problem arises
when you commit and push the
notebook with its outputs to git.
Git thinks that the JSON gibberish is equally important as your code. The three
lines of code that you changed are mixed with three thousand lines that were
changed in the JSON gibberish. Trying to compare the two versions becomes
useless due to all the extra noise.
It becomes even more confusing if you have changed some code after the outputs
were generated. Now the code and outputs that are stored in the version control
do not match anymore.
You can manually clear the outputs from the main menu (Cells -> All Output ->
Clear) before creating your git commit.
You can set up a pre-commit hook for git that clears outputs automatically
8
Don't use the --force
Sometimes when you try to push to the remote
repository, git tells you that something is
wrong and aborts. The error message might
offer you an option to "use the force" (the
-f or --force ). Don't do it! Even if the
error message calls for your inner Jedi, just
don't. It's the dark side.
Obviously, there are reasons why the --force exists and it serves a purpose
in some situations. None of those arguments apply to you young padawan.
Whatever the case, read the error message, try to reason what could be the issue,
ask someone else to help you if needed, and get the underlying issue fixed.
9
Example #2: a good repository
In real life you often make all kinds of ad-hoc things and end up in the situation #1
on your local machine. If you haven't pushed anything to the public remote yet,
you can still fix the situation. We recommend learning how to use the interactive
rebase.
Simply use:
The interactive mode offers many different options for tweaking the history,
rewording commit messages, and even changing the order. Learn more about the
interactive rebase from here.
10
The idea with branches is to eventually merge back to the main branch and
update "the central truth". This is where pull requests come into play. The rest of
the world doesn't care about your commits in your own branch, but merging to
main is where your branch becomes the latest truth. That is when it's time to
make a pull request.
Pull requests are not a git concept, but a GitHub concept. They are a request for
making your branch the new central truth. Using the pull request, other users will
check your changes before they are allowed to become the new central truth.
GitHub offers great tools to make comments, suggest their modifications, signal
approval, and finally apply the merge automatically.
11
Python dependencies
Dependency management is the act of managing all the external pieces that your
project relies on. It has the risk profile of a sewage system. When it works, you
don’t even know it’s there, but it becomes excruciating and almost impossible to
ignore when it fails.
You could write installation instructions on a piece of paper. You could write them
in your source code comments. You could even hardcode the install commands
straight into the program. Dependency management? Yes. Recommended? Nope.
The recommended way is to decouple the dependency information from the code
in a standardized, reproducible, widely-accepted format. This allows version
pinning and easy deterministic installation. There are many options, but we’ll
describe the classic combination of pip and requirements.txt file in this article.
But before we go there, let's first introduce the atomic unit of Python dependency:
the package.
12
What is a package?
You are probably working with packages every day by referring to them in your
code with the Python import statement.
While you could install packages by simply downloading them manually to your
project, the most common way to install a package is via PyPi (Python Package
Index) using the famous pip install command.
Note: Never use sudo pip install . Never. It is like running a virus. The
results are unpredictable and will cause your future self major pain.
Never install Python packages globally either. Always use virtual environments.
Python virtual environment is a safe bubble. You should create a protective bubble
around all the projects on your local computer. If you don't, the projects will hurt
each other. Don't let the sewage system leak!
13
Go to your project root directory and create a virtual environment:
python3 -m venv mybubble
Now we are in the bubble. Your terminal should show the virtual environment
name in parenthesis like this:
(mybubble) johndoe@hello:~/myproject$
Now that we are in the bubble, installing packages is safe. From now on, any pip
install command will only have effects inside the virtual environment. Any code
you run will only use the packages inside the bubble.
If you list the installed packages you should see a very short list of currently
installed default packages (like the pip itself).
pip list
Package Version
------------- -------
pip 20.0.2
pkg-resources 0.0.0
setuptools 44.0.0
This listing is no longer for all the Python packages in your machine, but all the
Python packages inside your virtual environment. Also, note that the Python
version used inside the bubble is the Python version you used to create the bubble.
Always create virtual environments for all your local projects and run your code
inside those bubble(s). The pain from conflicting package versions between
projects is the kind of pain that makes people quit their jobs. Don't be one of
those people.
Imagine you have a project that depends on Pandas package and you want to
communicate that to the rest of the world (and your future self). Should be easy,
right?
14
First of all, it is risky to just say: "You need Pandas".
At first, everything is fine, but after six months, a new numpy version 1.19.6 is
released with a showstopper bug.
Now if someone installs your project, they'll get pandas 1.2.1 with buggy numpy
1.19.6, and probably a few gray hairs as your software spits weird errors. The
sewage system is leaking. The installation process was not deterministic!
The most reliable way is to pin everything. Pin the dependencies of the
dependencies of the dependencies of the dependencies, of the… You'll get the
point. Pin'em as deep as the rabbit hole goes. Luckily there are tools that make
this happen for you.
Note: If you are building a reusable package and not a typical project, you should
not pin it so aggressively (this is why Pandas doesn't pin to the exact Numpy
version). It is considered best practice for the end-user of the package to decide
what and how aggressively pin. If you as a package creator pin everything, then
you close that door from the end-user.
Whenever you call pip install to get some hot new package into your
project, you should stop and think for a second. This will create a new dependency
for your project. How do I document this?
You should write down new libraries and their version number to a requirements.
txt file. It is a format understood by pip to install multiple packages in one go.
15
# requirements.txt
pandas==1.2.1
matplotlib==3.4.0
# Install
pip install -r requirements.txt
This is already much better than most data science projects that one encounters,
but we can still do better. Remember the recursive dependency rabbit hole from
the previous chapter about version pinning. How do we make the installation
more deterministic?
# Requirements.in
matplotlib==3.4.0
# Auto-generate requirements.txt
pip-compile requirements.in
# Generated requirements.txt
cycler==0.11.0
# via matplotlib
kiwisolver==1.3.2
# via matplotlib
matplotlib==3.4.0
# via -r requirements.in
numpy==1.22.0
# via
# matplotlib
# pandas
pandas==1.2.1
# via -r requirements.in
pillow==9.0.0
# via matplotlib
pyparsing==3.0.6
# via matplotlib
16
python-dateutil==2.8.2
# via
# matplotlib
# pandas
pytz==2021.3
# via pandas
six==1.16.0
# via python-dateutil
The pip-compile will then generate the perfect pinning of all the libraries into the
requirements.txt, which provides all the information for a deterministic installation.
Easy peasy! Remember to commit both files into your git repository, too.
Pinning the Python version is tricky. There is no straightforward way to pin the
version dependency for Python itself (without using e.g conda).
The most bullet-proof way to force the Python version is to use Docker containers,
which we will talk about in the next chapter!
17
Main takeaways
Don't avoid dependency management – Your future self will appreciate the
documented dependencies when you pour coffee all over your MacBook.
Always use virtual environments on your local computer – Trying out that
esoteric Python library with 2 GitHub stars is no big deal when you are safely
inside the protective bubble.
Pinning versions is better than not pinning – Version pinning protects from
packages moving forward when your project is not.
Packages change a lot, Python not so much – Even a single package can have
dozens of nested dependencies and they are constantly changing, but Python is
relatively stable and future-proof.
When your project matures enough and elevates into the cloud and into production,
you should look into pinning the entire environment and not just the Python stuff.
This is where Docker containers are your best friend as they not only let you pin
the Python version but anything inside the operating system. It is like a virtual
environment but on a bigger scale.
18
Docker
What is Docker?
Any software faces the same problem as the astronaut. As soon as we leave
home and go out into the world, the environment gets hostile, and a protective
mechanism to reproduce our natural environment is mandatory. The Docker
container is the spacesuit of programs.
Docker isolates the software from all other things on the same system. A program
running inside a “spacesuit” generally has no idea it is wearing one and is unaffected
by anything happening outside.
19
The containerized stack
Operating system: Low-level interfaces and drivers to interact with the hardware
The fundamental idea is to package an application and its dependencies into a single
reusable artifact, which can be instantiated reliably in different environments.
Dockerfile
We could define the temperature, radiation, and oxygen levels for a spacesuit,
but we need instructions, not requirements. Docker is instruction-based, not
requirement-based. We will describe the how and not the what. To do that, we
create a text file and name it Dockerfile .
20
# Dockerfile
FROM python:3.9
RUN pip install tensorflow==2.7.0
RUN pip install pandas==1.3.3
FROM python:3.9
COPY requirements.txt /tmp
RUN pip install -r /tmp/requirements.txt
The COPY command copies a file from your local disk, like the
requirements.txt , into the image. The RUN command here installs all
the Python dependencies defined in the requirements.txt in one go.
Note: All the familiar Linux commands are at your disposal when using RUN.
Docker image
Now that we have our Dockerfile , we can compile it into a binary artifact
called an image.
The reason for this step is to make it faster and reproducible. If we didn’t compile
it, everyone needing a spacesuit would need to find a sewing machine and
painstakingly run all the instructions for every spacewalk. That is too slow but also
indeterministic. Your sewing machine might be different from mine. The tradeoff for
21
speed and quality is that images can be quite large, often gigabytes, but a gigabyte
in 2022 is peanuts anyway.
This builds an image stored on your local machine. The -t parameter defines
the image name as “myimage” and gives it a tag “1.0”. To list all the images, run:
This builds an image stored on your local machine. The -t parameter defines the
image name as “myimage” and gives it a tag “1.0”. To list all the images, run:
Docker container
Finally, we are ready for our spacewalk. Containers are the real-life instances of
a spacesuit. They are not really helpful in the wardrobe, so the astronaut should
perform a task or two while wearing them.
The instructions can be baked into the image or provided just in time before
starting the container. Let’s do the latter.
docker run myimagename:1.0 echo "Hello world"
This starts the container, runs a single echo command, and closes it down.
Now we have a reproducible method to execute our code in any environment that
supports Docker. This is very important in data science, where each project has
many dependencies, and reproducibility is at the heart of the process.
Containers close down automatically when they have executed their instructions,
but containers can run for a long time. Try starting a very long command in the
background (using your shell’s & operator):
22
docker run myimagename:1.0 sleep 100000000000 &
To stop this container, take the container ID from the table and call:
docker stop <CONTAINER ID>
This stops the container, but its state is kept around. If you call
docker ps -a
You can see that the container is stopped but still exists. To completely destroy it:
docker rm <CONTAINER ID>
It is great for debugging the inner workings of an image when you can freely run
all the Linux commands interactively. Go back to your host shell by running the
exit command.
Registry = Service for hosting and distributing images. The default registry is the
Docker Hub.
23
Repository = Collection of related images with the same name but different tags.
Usually, different versions of the same application or service.
Tag = An identifier attached to images within a repository (e.g., 14.04 or stable )
ImageID = Unique identifier hash generated for each image
It means that you can encode registry hostname and a bunch of slash-separated
“name components” into the name of your image. Honestly, this is quite convoluted,
but such is life.
It may vary per platform. For Google Cloud Platform (GCP) the convention is:
<registry>/<project-id>/
<repository-name>/<image>@<image-digest>:<tag>
It is up to you to figure out the correct naming scheme for your case.
24
Docker images and secrets
Just like it is a terrible practice to push secrets into a git repository, you shouldn’t
bake them into your Docker images either!
Images are put into repositories and passed around carelessly. The correct
assumption is that whatever goes into an image may be public at some point. It is
not a place for your username, password, API token, key code, TLS certificates,
or any other sensitive data.
Neither case should be solved by baking things permanently into the image. Let’s
look at how to do it differently.
Build-time secrets
Quick googling will give you many different options to solve this problem, like
using multi-stage builds, but the best and most modern way is to use BuildKit.
BuildKit ships with Docker but needs to be enabled for builds by setting up the
environment variable DOCKER_BUILDKIT .
For example:
DOCKER_BUILDKIT=1 docker build .
BuildKit offers a mechanism to make secret files safely available for the build
process.
25
Let’s first create secret.txt with the contents:
TOP SECRET ASTRONAUT PASSWORD
RUN --mount=type=secret,
id=mypass cat /run/secrets/mypass
Let’s build the image, adding `--secret` to inform `docker build` about where to
find this secret:
Everything worked, but we didn’t see the contents of secret.txt printed out in our
terminal as we expected. The reason is that BuildKit doesn’t log every success by
default.
Among all the logs printed out, you should find this part:
#5 [2/2] RUN --mount=type=secret,id=mypass cat /run/
secrets/mypass
#5 sha256:7fd248d616c172325af799b6570d2522d3923638ca41181f
ab438c29d0aea143
#5 0.248 TOP SECRET ASTRONAUT PASSWORD
26
It is proof that the build step had access to secret.txt .
With this approach, you can now safely mount secrets to the build process without
worrying about leaking keys or passwords to the resulting image.
Runtime secrets
If you need a secret – say database credentials – when your container is running in
production, you should use environment variables to pass secrets into the container.
Never bake any secrets straight into the image at build time!
Tip: You can also fetch the secrets from a secret store like Hashicorp Vault!
GPU support
Docker with GPUs can be tricky. Building an image from scratch is beyond the
scope of this article, but there are five prerequisites for a modern GPU (NVIDIA)
container.
Image:
• CUDA/cuDNN libraries
• GPU versions of your framework like Tensorflow (when needed)
Host machine:
• GPU drivers
• NVidia Docker Toolkit
• Docker run executed with --gpus all
The best approach is finding a base image with most prerequisites already
baked in. Frameworks like Tensorflow usually offer images like tensorflow/
tensorflow:latest-gpu , which are a good starting point.
27
When troubleshooting, you can first try to test your host machine:
nvidia-smi
If you get an error from either, you’ll have an idea whether the problem lies inside
or outside the container.
It’s also a good idea to test your frameworks. For example Tensorflow:
docker run --gpus all -it --rm tensorflow/
tensorflow:latest-gpu python -c "import
tensorflow as tf;print(tf.reduce_sum
(tf.random.normal([1000, 1000])))"
The output may be verbose and have some warnings, but it should end with
something like:
Created device /job:localhost/replica:0/task:0/
device:GPU:0 with 3006 MB memory: -> device: 0, name:
NVIDIA GeForce GTX 970, pci bus id: 0000:01:00.0,
compute capability: 5.2
tf.Tensor(-237.35098, shape=(), dtype=float32)
28
Docker containers vs. Python virtual environments
To put it another way, for local development virtual environments are like wearing
sunscreen on the beach, while Docker containers are like wearing a spacesuit –
usually uncomfortable and mostly impractical.
29
What every data scientist should know about
the command line
The article will focus on the UNIX-style (Linux & Mac) command line and ignore
the rest (like Windows's command processor and PowerShell) for clarity. We have
observed that most data scientists are on UNIX-based systems these days.
What is it?
The command line is a text-based interface to your computer. You can think of it
kind of as "popping the hood" of an operating system. Some people mistake it as
just a relic of the past but don't be fooled. The modern command line is rocking
like never before!
Back in the day, text-based input and output were all you got (after punch cards,
that is). Like the very first cars, the first operating systems didn't even have a hood
30
to pop. Everything was in plain sight. In this environment, the so-called REPL (read-
eval-print loop) methodology was the natural way to interact with a computer.
REPL means that you type in a command, press enter, and the command is
evaluated immediately. It is different from the edit-run-debug or edit-compile-
run-debug loops, which you commonly use for more complicated programs.
The command line generally follows the UNIX philosophy of "Make each program
do one thing well", so basic commands are very straightforward. The fundamental
premise is that you can do complex things by combining these simple programs.
The old UNIX neckbeards refer to "having a conversation with the computer."
Almost any programming language in the world is more powerful than the
command line, and most point-and-click GUIs are simpler to learn. Why would
you even bother doing anything on the command line?
The first reason is speed. Everything is at your fingertips. For telling the computer
to do simple tasks like downloading a file, renaming a bunch of folders with a
specific prefix, or performing a SQL query on a CSV file, you really can't beat the
agility of the command line. The learning curve is there, but it is like magic once
you have internalized a basic set of commands.
The third reason is automation. Unlike in GUI interfaces, everything done in the
command line can eventually be automated. There is zero ambiguity between the
instructions and the computer. All those repeated clicks in the GUI-based tools
that you waste your life on can be automated in a command-line environment.
The fourth reason is extensibility. Unlike GUIs, the command line is very modular.
The simple commands are perfect building blocks to create complex functionality
for myriads of use-cases, and the ecosystem is still growing after 50 years. The
command line is here to stay.
The fifth reason is that there are no other options. It is common that some of
the more obscure or bleeding-edge features of a third party service may not
31
be accessible via GUI at all and can only be used using a CLI (Command Line
Interface).
Terminal = The application that grabs the keyboard input passes it to the program
being run (e.g. the shell) and renders the results back. As all modern computers
have graphical user interfaces (GUI) these days, the terminal is a necessary GUI
frontend layer between you and the rest of the text-based stack.
Shell = A program that parses the keystrokes passed by the terminal application
and handles running commands and programs. Its job is basically to find where the
programs are, take care of things like variables, and also provide fancy completion
with the TAB key. There are different options like Bash, Dash, Zsh, and Fish, to
name a few. All with slightly different sets of built-in commands and options.
Operating system = The program that executes all other programs. It handles the
direct interaction with all the hardware like the CPU, hard disk, and network.
32
And what's up with that tilde (~) character? What does it even mean that the
current directory is ~/hello ?
Tilde is shorthand for the home directory, a place for all your personal files.
My home directory is /home/juha , so my current working directory is
/home/juha/hello , which shorthands to ~/hello . (The convention
~username refers to someone's home directory in general; ~juha refers to my
home directory and so on.)
From now on, we will omit everything else except the dollar sign from the prompt
to keep our examples cleaner.
When you type something after the prompt and press enter, the shell program will
attempt to parse and execute it. Let's say:
$ generate million dollars
generate: command not found
The shell program takes the first complete word generate and considers that
a command.
The two remaining words, million and dollars , are interpreted as two
separate parameters (sometimes called arguments).
Now the shell program, whose responsibility is to facilitate the execution, goes
looking for a generate command. Sometimes it is a file on a disk and sometimes
something else. We'll discuss this in detail in our next chapter.
$ df --human-readable
33
proc 0 0 0 - /proc
udev 16G 0 16G 0% /dev
. . .
Here we run a command " df " (short for disk free) with the " --human
-readable " option.
It is common to use "-" (dash) in front of the abbreviated option and "--" (double-
dash) for the long-form. (These conventions have evolved over time; see this blog
post for more information.)
$ df -h
$ df --human-readable
You can generally also merge multiple abbreviated option after a single dash.
df -h -l -a
df -hla
If you want to know all the available options, you can usually get a listing with the
--help parameter:
df --help
34
Tip: The common thing to type into the command line is a long file path. Most shell
programs offer TAB key to auto-complete paths or commands to avoid repetitive
typing. Try it out!
Builtins, functions, and aliases are virtual, and they are executed within the existing
shell process. These commands are mostly simple and lightweight.
For binary commands, the shell program is responsible for finding the actual binary
file from the file system that matches the command name. Don't expect the shell
to go looking everywhere on your machine for a command, though. Instead, the
shell relies on an environment variable called $PATH , which is a colon-delimited
(:) list of paths to iterate over. The first match is always chosen.
If you want to figure out where the binary file for a certain command is, you can
call the which command.
$ which python
/home/juha/.pyenv/shims/python
35
Now that you know where to find the file, you can use the file utility to figure
out the general type of the file.
$ file /home/juha/.pyenv/shims/pip
/home/juha/.pyenv/shims/pip: Bourne-Again shell script
text executable, ASCII text
$ file /usr/bin/python3.9
/usr/bin/python3.9: ELF 64-bit LSB executable, x86-64,
version 1 (SYSV), dynamically linked, interpreter /lib64/
ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
Usually we do not execute our Python scripts as commands but use the interpreter
like this:
$ python hello.py
Hello world
Here python is the command, and hello.py is just a parameter for it. (If you
look at what python --help says, you can see it corresponds to the variation
"file: program read from script file", which really does make sense here.)
For this to work, we need two things. Firstly, the first line of hello.py needs to
define a script interpreter using a special #! Notation.
#!/usr/bin/env python3
print("Hello world")
The #! notation tells the operating system which program knows how to
interpret the text in the file and has many cool nicknames like shebang, hashbang,
or my absolute favorite the hash-pling!
The second thing we need is for the file to be marked executable. You do that
with the chmod (change mode) command: chmod u+x hello.py will set
the eXecutable flag for the owning User.
36
A builtin is a simple command hard-coded into the shell program itself. Commands
like cd , echo , alias , and pwd are usually builtins.
If you run the help command (which is also a builtin!), you'll get a list of all the
builtin commands.
If you want to list all the functions currently available, you can call (in Bash-like
shells):
$ declare -F
Aliases are like macro. A shorthand or an alternative name for a more complicated
command.
For example, you want new command showerr to list recent system errors:
Since functions and aliases are not physical files, they do not persist after closing
the terminal and are usually defined in the so-called profile file ~/.bash_profile
or the ~/.bashrc file, which are executed when a new interactive or login shell
is started. Some distributions also support a ~/.bash_aliases file (which is
likely invoked from the profile file -- it's scripts all the way down!).
If you want to get a list of all the aliases currently active for your shell, you can just
call the alias command without any parameters.
37
Combining commands together
Pretty much anything that happens on your computer happens inside processes.
Binary and script commands always start a new process. Builtins, functions, and
aliases piggyback on the existing shell program's process.
What are these streams? They are simply arbitrary streams of data. No encoding
is specified, which means it can be anything. Text, video, audio, morse-code,
whatever the author of the command felt appropriate. Ultimately your computer
is just a glorified data transformation machine. Thus it makes sense that every
process has an input and output, just like functions do. It also makes sense to
separate the output stream from the error stream. If your output stream is a
video, then you don't want the bytes of the text-based error messages to get
mixed with your video bytes (or, in the 1970s, when the standard error stream was
implemented after your phototypesetting was ruined by error messages being
typeset instead of being shown on the terminal).
By default, the stdout and stderr streams are piped back into your terminal, but
these streams can be redirected to files or piped to become an input of another
process. In the command line, this is done by using special redirection operators
( | , > , < , >> ).
Let's start with an example. The curl command downloads an URL and directs
its standard output back into the terminal as default.
$ curl https://filesamples.com/samples/document/csv/sample1.csv
"May", 0.1, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0
"Jun", 0.5, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 1
"Jul", 0.7, 5, 1, 1, 2, 0, 1, 3, 0, 2, 2, 1
"Aug", 2.3, 6, 3, 2, 4, 4, 4, 7, 8, 2, 2, 3
"Sep", 3.5, 6, 4, 7, 4, 2, 8, 5, 2, 5, 2, 5
38
Let's say we only want the first three rows. We can do this by piping two
commands together using the piping operator ( | ). The standard output of the
first command ( curl ) is piped as the standard input of the second ( head ). The
standard output of the second command ( head ) remains output to the terminal
as a default.
$ curl https://filesamples.com/samples/document/csv/sample1.
csv | head -n 3
"May", 0.1, 0, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0
"Jun", 0.5, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 1
"Jul", 0.7, 5, 1, 1, 2, 0, 1, 3, 0, 2, 2, 1
Usually, you want data on the disk instead of your terminal. We can achieve this
by redirecting the standard output of the last command ( head ) into a file called
foo.csv using the > operator.
$ curl https://filesamples.com/samples/document/csv/sample1.
csv | head -n 3 > foo.csv
Finally, a process always returns a value when it ends. When the return value is
zero (0), we interpret it as successful execution. If it returns any other number,
it means that the execution had an error and quit prematurely. For example, any
Python exception which is not caught by try/except has the Python interpreter
exit with a non-zero code.
You can check what the return value of the previously executed command was
using the $? variable.
$ curl http://fake-url
curl: (6) Could not resolve hostmm
$ echo $?
6
39
Previously we piped two commands together with streams, which means they
ran in parallel. The return value of a command is important when we combine
two commands together using the && operator. This means that we wait for the
previous command to succeed before moving on to the next. For example:
Here we try to copy the file /tmp/apple to two different locations and finally
delete the original file. Using the && operator means that the shell program
checks for the return value of each command and asserts that it is zero (success)
before it moves. This protects us from accidentally deleting the file at the end.
If you're interested in writing longer shell scripts, now is a good time to take a
small detour to the land of the Bash "strict mode" to save yourself from a lot of
headache.
For this purpose, we recommend using one of the classics, all the way from 1976,
the make command. It is a simple, ubiquitous, and robust command which was
originally created for compiling source code but can be weaponized for executing
and documenting arbitrary scripts.
The default way to use make is to create a text file called Makefile into the
root directory of your project. You should always commit this file into your version
control system.
40
Let's create a very simple Makefile with just one "target". They are called
targets due to the history with compiling source code, but you should think of
target as a task.
Makefile
hello:
echo "Hello world!"
Now, remember we said this is a classic from 1976? Well, it's not without its quirks.
You have to be very careful to indent that echo statement with a tab character,
not any number of spaces. If you don't do that, you'll get a "missing separator"
error.
Notice how make also prints out the recipes and not just the output. You can limit
the output by using the -s parameter.
$ make -s hello
Hello world!
Next, let's add something useful like downloading our training data.
Makefile
hello:
echo "Hello world!"
get-data:
mkdir -p .data
curl <https://filesamples.com/samples/document/csv/sample1.csv>
> .data/sample1.csv
echo "Downloaded .data/sample1.csv"
41
(Aside: The more seasoned Makefile wizards among our readership would note
that get-data should really be named .data/sample1.csv to take
advantage of Makefile's shorthands and data dependencies.)
Makefile
DOCKER_IMAGE := mycompany/myproject
VERSION := $(shell git describe --always --dirty --long)
default:
echo "See readme"
init:
pip install -r requirements.txt
pip install -r requirements-dev.txt
cp -u .env.template .env
build-image:
docker build .
-f ./Dockerfile
-t $(DOCKER_IMAGE):$(VERSION)
push-image:
docker push $(DOCKER_IMAGE):$(VERSION)
pin-dependencies:
pip install -U pip-tools
pip-compile requirements.in
pip-compile requirements-dev.in
upgrade-dependencies:
pip install -U pip pip-tools
pip-compile -U requirements.in
pip-compile -U requirements-dev.in
This example Makefile would allow your team members to initialize their
environment after cloning the repository, pin the dependencies when they
introduce new libraries, and deploy a new docker image with a nice version tag.
42
What every data scientist should know
about programming tools
There are many ways to give instructions to computers, but writing long text-
based recipes is one of the most challenging and versatile ways to command our
silicon-based colleagues. We call this approach programming, and most data
scientists accept that it is a part of their profession, but unfortunately, many
underestimate the importance of tooling for it.
The minimum tooling is a simple text editor and the ability to execute your
programs. Most operating systems come with an editor (like Notepad in Windows)
and the ability to run code (Mac & Linux ship with a c++ compiler). Programming
in this minimalistic way went out of fashion in the 90s.
Notebooks (like Jupyter) are often the first contact with programming for any data
scientist. There is absolutely nothing wrong with notebooks, and they are fantastic
for many use-cases, but they are not the only option for writing programs. Too many
get stuck in the vanilla notebook and do not realize what they are missing out on.
There are many tools for writing, refactoring, navigating, debugging, analyzing,
and profiling source code. Most tools are stitched together into a single program
called IDE (Integrated Development Environment), but some remain as separate
stand-alone programs. Most modern IDEs (like VSCode and PyCharm) also have a
vibrant plugin ecosystem to extend the built-in capabilities, and the same can be
said about the notebooks too.
Code Completion
43
The funny thing is that I always thought code completion is something that only
IDEs do and Jupyter doesn't, but it does! Start writing some code in your notebook
and press the TAB key. It's magic.
Code completion is a bit smoother in IDEs, though. There is no need to keep firing the
TAB key, and the popups offer more context like method signatures, documentation
and tips.
44
Code completion with GitHub CoPilot
The bottom line is that if you have never used code completion before, you
should start doing that today. It will change your life!
Refactoring
Modern IDE is context-aware and truly understands code. It knows what a method
is, and the rename operation isn't just a dummy string operation but safely and
robustly renames all usages across the entire codebase.
45
Renaming a variable in PyCharm
Renaming a method or a variable is a classic, but there are dozens of useful little tools
out there like adding imports, extracting methods, auto-updating class initializers,
and commenting a large chunk of code to name a few.
If you want more inspiration, check out the documentation for PyCharm & VSCode
https://www.jetbrains.com/help/pycharm/refactoring-source-code.html
https://code.visualstudio.com/docs/editor/refactoring
46
Navigation
The great thing about using an IDE is that you can dive into the source code of 3rd
party package. Want to know what filter() method in Pandas actually does under
the hood? Just CTRL+click it and see the implementation yourself! The source
code for Pandas is not some next-level voodoo. It is vanilla Python code written by
a flesh-and-blood programmer just like you. Don’t be afraid to dive in!
47
Navigation tools are great at putting everything at your fingertips. Almost every
IDE has a generic search tool, which is like having a google search engine for
your project. “What was the name of that method again?” and “I need to edit the
Dockerfile now” are just hotkey away from getting solved.
Learning all the hotkeys for navigation feels like a burden at first, but jumping
around in code becomes second nature once you have internalized them.
Navigation is one aspect where the notebooks are unfortunately quite lacking,
perhaps due to being designed for a single piece of code and not a large codebase.
Debugger
Let's face it, every piece of code out there has bugs,
and when you are writing something new, your
program is broken pretty much all of the
time. Debugging is the act of finding
out why the darned thing doesn't
do what you expect. Someone
once said that debugging is like
being the detective in a crime movie
where you are also the murderer.
The easy and obvious bugs are squashed just by staring at the code. There is
nothing wrong with that. If that doesn't work, the following approach is running
the program with some extra logging, which is fine too. But once we get into the
twilight zone of the more bizarre bugs, where nothing seems to make sense, you
want to get yourself a debugger.
A debugger is a tool that lets you run the program and inspect its execution like
you had one of those 10000 frames per second stop-motion cameras. You get
to run the program step-by-step, see the value of every variable, and follow the
execution down to the rabbit hole of method calls as deep as you need to go. You
no longer need to guess what happens. The entire state of the program is at your
fingertips.
48
VSCode debugger inspecting a running program
As data scientists, we often run our production code in the cloud, and the most
bizarre bugs tend to thrive in these situations. When your production environment
(cloud) slightly deviates from your development environment (laptop), you are in
for some painful moments. It is where debuggers shine, as they let you debug
remotely and reliably compare the two environments.
Python ships with a built-in command-line debugger pdb and Jupyter lets you use
it with the %debug magics, but we highly recommend using visual debuggers in
the IDEs like PyCharm and VSCode. Jupyterlab also has a visual debugger available
as an extension (https://github.com/jupyterlab/debugger).
49
JupyterLab debugger extension
A debugger might be an overkill for simple bugs, but the next time you find
yourself staring at the code for more than an hour, you might want to consider
trying out a debugger. You’d be surprised how much it changes your perspective.
Profiler
Often we start guessing blindly where the bottleneck is in our code. We might
even manually write some ad-hoc logging to time our method calls. Human
intuition can be pretty bad at this. We often end up micro-optimizing things that
make no difference at all. It is better than nothing, but to be completely honest,
you need a profiler.
50
A profiler is a tool that times everything and can also measure memory usage in
great detail. You run your program using a profiler, and you know exactly where
the processing power is spent and who hoarded the precious megabytes. Like the
debugger in the previous chapter, you no longer need to guess. All the information
is at your fingertips.
A flame graph visualizing the time spent between different parts of the program
In a typical data science crime scene, the murderer is a 3rd party library like Pandas.
It is not that there is anything inherently wrong with it, but they are optimized for
ease of use instead of making sure you get the best performance. Complicated
things are hidden from you by design. The end result is code that works but is
very slow. Profilers are an excellent tool for exposing this when needed. It is not
uncommon to get a 100x speed-up by switching one Pandas method to another!
The best profilers are standalone programs or IDE plugins, but all is not lost
in the notebook space. Jupyter notebook has built-in magic commands like
%time and %prun which can tell you a lot, but are a bit lacking in the user
experience when compared to their visual counterparts.
51
Profiling a cell in Jupyter notebook with %%prun
52
Conclusion
A professional lumberjack doesn’t cut down a forest with a rusty old handsaw,
he uses a chainsaw because it gets the job done. In this regard, programming
is pretty much like any other job. Programming in a vanilla notebook might be
fine for small things, but engineering for production without proper tooling is not
recommended, and the gap is widening every day in the wake of new AI-assisted
programming tools. I hope this article has inspired data scientists to explore what
is out there.
53
Final takeaways
MLOps is the term used for operating a machine learning project in production.
It really comes down to running reproducible machine learning workloads
with confidence. This ebook teaches the fundamentals of Git, Docker, Python
dependencies, and Bash, all requirements for getting your MLOps right to do
pioneering machine learning.
There is more to MLOps than these, though. The chapters in this book are just
building blocks. It is not enough to build robust Docker images, operate clean git
repositories, and master the command line. All those images, repositories, and
scripts need to come together. Something needs to glue these things to create a
meaningful whole.
54
Build in Production
The infrastructure stack formachine learning development
You don’t need to bridge the gap between ML and Ops when every model and pipeline runs
on Valohai. Our platform brings powerful cloud infrastructure to every data scientist and
ML engineer and unlocks their full creativity in building machine learning solutions.
DEVELOPER-FIRST MLOPS
Ship faster and more Empower the whole Build without arbitrary
frequently team limitations
Experimentation and Valohai democratizes the Every production system is
productization on the same access to experiments, models, different which is why Valohai
platform makes shipping easy. metrics and compute. can integrate with any system.