3 Machine Learning Tools
3 Machine Learning Tools
Artificial
Intelligence
Institute
Jeremy Pinto
Applied Research Scientist, Mila
[email protected]
Contents
● Python
● Machine Learning Tools
● Deep Learning Frameworks
● Project Management
● Hardware
https://www.python.org/
2
Quebec
Artificial
Intelligence
Institute
Python
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.
Comparable languages:
https://www.python.org/
4
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.
Comparable languages:
https://www.python.org/
5
Python
https://www.python.org/dev/peps/pep-0020/
https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds
6
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.
Comparable languages:
https://www.python.org/
7
Python Libraries
8
Python Libraries
Machine Learning
Scientific Computing
Deep Learning
9
Python Libraries
Machine Learning
Scientific Computing
Data Visualisation
Deep Learning
10
Python Libraries
Machine Learning
Scientific Computing
Data Visualisation
Deep Learning
Data Analysis
11
Python Libraries
Machine Learning
Scientific Computing
Data Visualisation
Deep Learning
Data Analysis
Coding Environments
12
Python Libraries Library management
Machine Learning
Scientific Computing
Data Visualisation
Deep Learning
Data Analysis
Coding Environments
13
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.
Comparable languages:
14
Interactive Notebooks
Interactive notebooks are a way to run python code in cells and view the generated
output directly inline. They also support markdown to display inline text. It is very
useful for generating graphs, displaying results, sharing ideas, etc. The most
common notebook is Jupyter Notebooks.
https://jupyter.org/
15
Colab
An even easier way to get started with python is using Google Colab, an interactive
python notebook that works directly from your browser. It comes pre-configured with
python3, popular scientific computing libraries (numpy, pytorch, tensorflow, etc.) and
you can even have access to a free GPU (k80). We will use it for our tutorials.
16
Python 101
Running a hello world program with Colab is literally this easy:
17
Quebec
Artificial
Intelligence
Institute
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
19
Pipeline
This is what our pipeline will look like:
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
20
Pipeline
We begin by exploring the data with tools we are familiar with:
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
21
Data Exploration
The data is a spreadsheet (.csv) with various measurements and two diagnoses:
Malignant and Benign. This is a binary classification problem.
22
Pipeline
Now that we know what kind of data to expect, we can import it to python using the
pandas library.
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
23
Data Import
We use pandas to import the csv file and
quickly explore some useful information
such as the data quantity and distribution.
Pandas will allow us to easily manipulate
our data structure and preserve it in its
tabular form.
24
Pipeline
Let’s start looking at our data visually.
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
25
Matplotlib
We can use matplotlib to draw a pie
chart. So far this seems like
something that could easily be done
with spreadsheets, but perhaps
overly complicated.
26
Seaborn
Libraries like seaborn use the rich
features of matplotlib combined with
pandas to provide an interface that can
produce rich data interpretation with few
lines of code.
27
Seaborn
Libraries like seaborn use the rich
features of matplotlib combined with
pandas to provide an interface that can
produce rich data interpretation with few
lines of code.
28
Pipeline
Next we will convert our data to Numpy arrays.
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
29
Numpy
Numpy allows us to manipulate
N-dimensional arrays in python. It is a
ubiquitous data structure supported by
most python libraries that manipulate
data.
30
Pipeline
Next we will use the scikit-learn API to train a machine learning model on our data.
Explore the Import the data to Process the data Make predictions
raw data python using machine
learning
31
Scikit-Learn
Scitkit-Learn is a machine learning library in python that contains many different types
of algorithms with a standardized API. It typically looks like this.
32
Scikit-Learn
Let’s look at an example of L2-regularized logistic regression on our dataset:
33
Scikit-Learn
In the previous example we trained and evaluated on the same data. We can use
scikit-learn to implement different cross-validation strategies on train and test sets.
34
Quebec
Artificial
Intelligence
Institute
https://www.tensorflow.org/guide/graphs
36
History of Deep Learning Frameworks
Created by
2.0 (RC)
Torch
(2002) (2007) (2008) (2013) (2014) (2015) (2016) (2017) (2018) (2019)
1.2
37
History of Deep Learning Frameworks
Created by
2.0 (RC)
Torch
(2002) (2007) (2008) (2013) (2014) (2015) (2016) (2017) (2018) (2019)
1.2
38
Pytorch vs. Tensorflow
Framework usage in papers submitted to ICLR
source:
https://www.reddit.com/r/MachineLearning/comments/9kys38/r_frameworks_mentioned_iclr_20182019_tensorflow/
39
Pytorch vs. Tensorflow
Dynamic Static
Computational Graph
(TorchScript static available) (Eager Mode default in 2.0)
40
Wrapper APIs
Some libraries offer convenience wrapper functions to make deep learning libraries
even more user-friendly and easy to get started with.
https://williamfalcon.github.io/pytorch-lightning/
https://www.tensorflow.org/guide/keras
https://www.fast.ai/
41
Quebec
Artificial
Intelligence
Institute
Project Management
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:
43
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:
44
Organized Codebase
● Organize your code in a logical hierarchical structure
● Use logical names for variables, filenames, folders etc.
● Avoid code duplication
○ Have the same training/data loading routines regardless of
experiments
○ Use object oriented programming paradigms
■ https://realpython.com/python3-object-oriented-programming/
● Follow PEP guidelines (lint your code, think “pythonic”, etc.)
○ https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds
○ http://flake8.pycqa.org/en/latest/#
● Document your code
https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html
45
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:
46
Version Control Systems
Version Control Systems (VCS) allow users to keep track of all changes that ever
happened in a project. It is great for working collaboratively on large codebases. A
very popular (and open source) VCS is git.
47
Version Control Systems
Git focuses on the “branching” paradigm, where changes can be added and merged
in a controlled and logical fashion. Each team member works on their own version of
the code and can propose or “merge” their changes when new features are ready. Git
handles (most) conflicts automatically.
https://www.atlassian.com/git/tutorials/what-is-version-control
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
https://medium.com/gradeup/version-control-system-get-up-to-speed-with-git-ea25b5cb7329
48
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Some tips for good python code and projects:
49
Virtual Environments
Since python relies on external libraries, versions matter. Libraries might get updated
with breaking changes which can be dangerous for legacy code. The solution to this
is isolating each project with their own virtual environment. Even python versions
can have an impact on code execution.
https://realpython.com/effective-python-environment/
50
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Some tips for good python code and projects:
51
Unit tests
It is good practice to write unit tests that you expect your code to pass every time you
run it. This is a great way to catch bugs early and ensure that code works as
expected in an explicit way.
https://docs.pytest.org/en/latest/
52
Experiment Management
Deep learning is inherently empirical. When solving a task, you will want to try many
variations of a network, i.e. use different hyperparameters, models, processing, etc.
You will therefore need tools to organize and monitor your experiments.
https://www.tensorflow.org/guide/summaries_and_tensorboard
https://mlflow.org/docs/latest/tutorial.html
53
Hyperparameter Optimization
Manually tweaking parameters can be exhausting and time consuming. Many
libraries also offer the possibility to suggest automatically how to update parameters
based on empirical results.
Observed
Your performances
Model Hyperparameter
Optimizer
Suggested
Hyperparameters
http://www.asimovinstitute.org/neural-network-zoo/
54
Experiment Management Libraries
55
Quebec
Artificial
Intelligence
Institute
Hardware
Hardware
A computer typically consists of a CPU and a GPU. They were both designed to
perform different types of operations. The CPU is designed to handle fewer
operations rapidly and sequentially while the GPU can handle many operations in
parallel but can be slower.
57
Hardware
A computer typically consists of a CPU and a GPU. They were both designed to
perform different types of operations. The CPU is designed to handle fewer
operations rapidly and sequentially while the GPU can handle many operations in
parallel but can be slower.
58
GPU vs. CPU
When training deep learning models, most operations are independent and can be
massively parallelized. Computations can be done in batches and results aggregated
for much faster results.
59
CPU vs. GPU
GPUs were originally developed by the gaming industry. The term first appeared with
the emergence of the PlayStation 1. They are optimized to perform parallel
computations. GPUs happen to be very efficient at computing matrix and vector
operations in parallel which can make them very suitable for machine learning and
deep learning.
60
CPU vs. GPU
61
CPU vs. GPU
Modern deep learning frameworks use highly optimized code to parallelize
computations automatically on the GPU.
62
CPU vs. GPU vs. TPU
As deep learning becomes more prevalent, chips
have become increasingly specialized for deep
learning. Google has developed their own Tensor
Processing Unit (TPU) to rival the GPU.
https://arxiv.org/pdf/1907.10701.pdf
63
Performance
Training deep learning models involves more than just matrix and tensor operations.
Data has to be preprocessed before it can be computed by the GPU. Most modern
frameworks will enable preprocessing of a new batch of data to be done on the CPU
while the GPU computes results on the previous batch to maximize resource
utilization.
https://www.tensorflow.org/beta/guide/data_performance
64
Edge Computing
While GPU/TPUs enable batch computations to be sped up, it should be mentioned
that CPUs can still be very fast and cost effective for inference. It can also be hard to
guarantee access to GPUs when doing edge computing, i.e on-device computation.
65
Cloud Computing
Deploying computations on the cloud has pros and cons.
Pros:
● Get access to the latest hardware
● Low initial cost
● Scale up and down based on demand
Cons
● Can be prohibitively expensive when running 24/7
66
Cloud Computing Cost + Hardware
https://dawn.cs.stanford.edu/benchmark/
https://towardsdatascience.com/maximize-your-gpu-dollars-a9133f4e546a
67
Cloud Computing Pipeline
Prototype
Deploy
Train
Preprocess
Evaluate
Validate
RESTful
API
Database
68
Quebec
Artificial
Intelligence
Institute
Questions?