0% found this document useful (0 votes)
10 views69 pages

3 Machine Learning Tools

The document outlines the use of Python and its libraries for machine learning and deep learning, highlighting its popularity and ease of use within the community. It discusses various tools and frameworks such as Jupyter Notebooks, Google Colab, and Scikit-Learn, as well as project management best practices for coding in Python. Additionally, it covers the importance of maintaining organized codebases, version control, and experiment management in machine learning projects.

Uploaded by

timmi365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views69 pages

3 Machine Learning Tools

The document outlines the use of Python and its libraries for machine learning and deep learning, highlighting its popularity and ease of use within the community. It discusses various tools and frameworks such as Jupyter Notebooks, Google Colab, and Scikit-Learn, as well as project management best practices for coding in Python. Additionally, it covers the importance of maintaining organized codebases, version control, and experiment management in machine learning projects.

Uploaded by

timmi365
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Quebec

Artificial
Intelligence
Institute

Machine Learning Tools

Jeremy Pinto
Applied Research Scientist, Mila
[email protected]
Contents
● Python
● Machine Learning Tools
● Deep Learning Frameworks
● Project Management
● Hardware

https://www.python.org/

2
Quebec
Artificial
Intelligence
Institute

Python
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.

Why is python great?

● The zen of python


● Rich ecosystem of libraries
● Easy to get started with (interactive notebooks)

Comparable languages:
https://www.python.org/

4
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.

Why is python great?

● The zen of python


● Rich ecosystem of libraries
● Easy to get started with interactive notebooks

Comparable languages:
https://www.python.org/

5
Python

“Code is read much more often than


it is written” - Guido Van Rossum

https://www.python.org/dev/peps/pep-0020/
https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds

6
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.

Why is python great?

● The zen of python


● Rich ecosystem of libraries
● Easy to get started with interactive notebooks

Comparable languages:
https://www.python.org/

7
Python Libraries

8
Python Libraries
Machine Learning

Scientific Computing

Deep Learning

9
Python Libraries
Machine Learning

Scientific Computing

Data Visualisation

Deep Learning

10
Python Libraries
Machine Learning

Scientific Computing

Data Visualisation

Deep Learning

Data Analysis

11
Python Libraries
Machine Learning

Scientific Computing

Data Visualisation

Deep Learning

Data Analysis

Coding Environments

12
Python Libraries Library management

Machine Learning

Scientific Computing

Data Visualisation

Deep Learning

Data Analysis

Coding Environments

13
Python
Python is a programming language developed in 1991. It is open-source and very
popular in the machine learning and deep learning community.

Why is python great?

● The zen of python


● Rich ecosystem of libraries
● Easy to get started with interactive notebooks

Comparable languages:

14
Interactive Notebooks
Interactive notebooks are a way to run python code in cells and view the generated
output directly inline. They also support markdown to display inline text. It is very
useful for generating graphs, displaying results, sharing ideas, etc. The most
common notebook is Jupyter Notebooks.

https://jupyter.org/

15
Colab
An even easier way to get started with python is using Google Colab, an interactive
python notebook that works directly from your browser. It comes pre-configured with
python3, popular scientific computing libraries (numpy, pytorch, tensorflow, etc.) and
you can even have access to a free GPU (k80). We will use it for our tutorials.

16
Python 101
Running a hello world program with Colab is literally this easy:

17
Quebec
Artificial
Intelligence
Institute

Machine Learning 101


Machine Learning 101
Let’s take a look at a standard machine learning pipeline using python. We will use
the Breast Cancer Wisconsin Data Set which identifies measurements of tumors as
malignant or benign

“Features are computed from a digitized


image of a fine needle aspirate (FNA) of a
breast mass. They describe characteristics of
the cell nuclei present in the image. ”

https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

19
Pipeline
This is what our pipeline will look like:

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

20
Pipeline
We begin by exploring the data with tools we are familiar with:

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

21
Data Exploration
The data is a spreadsheet (.csv) with various measurements and two diagnoses:
Malignant and Benign. This is a binary classification problem.

22
Pipeline
Now that we know what kind of data to expect, we can import it to python using the
pandas library.

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

23
Data Import
We use pandas to import the csv file and
quickly explore some useful information
such as the data quantity and distribution.
Pandas will allow us to easily manipulate
our data structure and preserve it in its
tabular form.

Link to the example

24
Pipeline
Let’s start looking at our data visually.

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

25
Matplotlib
We can use matplotlib to draw a pie
chart. So far this seems like
something that could easily be done
with spreadsheets, but perhaps
overly complicated.

Link to the example

26
Seaborn
Libraries like seaborn use the rich
features of matplotlib combined with
pandas to provide an interface that can
produce rich data interpretation with few
lines of code.

Link to the example

27
Seaborn
Libraries like seaborn use the rich
features of matplotlib combined with
pandas to provide an interface that can
produce rich data interpretation with few
lines of code.

Link to the example

28
Pipeline
Next we will convert our data to Numpy arrays.

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

29
Numpy
Numpy allows us to manipulate
N-dimensional arrays in python. It is a
ubiquitous data structure supported by
most python libraries that manipulate
data.

Here we convert our pandas data to


numpy arrays to be compatible with the
scikit-learn API.

Link to the example

30
Pipeline
Next we will use the scikit-learn API to train a machine learning model on our data.

Explore the Import the data to Process the data Make predictions
raw data python using machine
learning

Visually explore the


data

31
Scikit-Learn
Scitkit-Learn is a machine learning library in python that contains many different types
of algorithms with a standardized API. It typically looks like this.

32
Scikit-Learn
Let’s look at an example of L2-regularized logistic regression on our dataset:

Link to the example


https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

33
Scikit-Learn
In the previous example we trained and evaluated on the same data. We can use
scikit-learn to implement different cross-validation strategies on train and test sets.

Link to the example

34
Quebec
Artificial
Intelligence
Institute

Deep Learning Frameworks


Deep Learning Frameworks
A good deep learning framework will allow users to
easily implement neural networks and facilitate
deployment for production. These libraries are built
around the idea of computational graphs that
summarize the operations of a network and facilitate
and optimize things like backpropagation and
gradient descent.

https://www.tensorflow.org/guide/graphs
36
History of Deep Learning Frameworks

Created by

2.0 (RC)
Torch
(2002) (2007) (2008) (2013) (2014) (2015) (2016) (2017) (2018) (2019)

1.2

2017/11/15: Release of Theano 1.0.0


Mila stopped developing Theano after 2017.
Theano was the precursor of many important
ideas in the frameworks that followed.

37
History of Deep Learning Frameworks

Created by

2.0 (RC)
Torch
(2002) (2007) (2008) (2013) (2014) (2015) (2016) (2017) (2018) (2019)

1.2

2017/11/15: Release of Theano 1.0.0


Mila stopped developing Theano after 2017.
Theano was the precursor of many important
ideas in the frameworks that followed.

38
Pytorch vs. Tensorflow
Framework usage in papers submitted to ICLR

source:
https://www.reddit.com/r/MachineLearning/comments/9kys38/r_frameworks_mentioned_iclr_20182019_tensorflow/

39
Pytorch vs. Tensorflow

Open Source Yes (BSD) Yes (Apache 2.0)

TPU + GPU + CUDA +


Yes Yes
CUDNN Support
Tensorboard, TensorboardX, Tensorboard, TensorboardX,
Visualization
Visdom Visdom
“First-class Python integration” Harder debugging (but Eager
“Pythonic”?
Easy debugging, numpy-style Mode is default in 2.0)
Production TorchScript (torch.jit) or tensorflow.js, tensorflowlite,
Ready converters tf.serving

Pre-trained Models PyTorch Hub TensorFlow Hub

Dynamic Static
Computational Graph
(TorchScript static available) (Eager Mode default in 2.0)

40
Wrapper APIs
Some libraries offer convenience wrapper functions to make deep learning libraries
even more user-friendly and easy to get started with.

“Like most things, API design is not complicated, it just involves


following a few basic rules. They all derive from a founding principle:
you should care about your users. All of them. Not just the smart
ones, not just the experts. Keep the user in focus at all times. Yes,
API
including those befuddled first-time users with limited context and little
patience. Every design decision should be made with the user in
mind.”
- Francois Chollet, Keras Author
API

https://williamfalcon.github.io/pytorch-lightning/
https://www.tensorflow.org/guide/keras
https://www.fast.ai/
41
Quebec
Artificial
Intelligence
Institute

Project Management
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:

● Maintain an organized codebase


● Use Version Control for your code (Git)
● Use virtual environments
● Use unit tests

43
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:

● Maintain an organized codebase


● Use Version Control for your code (Git)
● Use virtual environments
● Use unit tests

44
Organized Codebase
● Organize your code in a logical hierarchical structure
● Use logical names for variables, filenames, folders etc.
● Avoid code duplication
○ Have the same training/data loading routines regardless of
experiments
○ Use object oriented programming paradigms
■ https://realpython.com/python3-object-oriented-programming/
● Follow PEP guidelines (lint your code, think “pythonic”, etc.)
○ https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds
○ http://flake8.pycqa.org/en/latest/#
● Document your code
https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html

45
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Here are some tips for good python code and projects:

● Maintain an organized codebase


● Version Control your code (Git)
● Use virtual environments
● Use unit tests

46
Version Control Systems
Version Control Systems (VCS) allow users to keep track of all changes that ever
happened in a project. It is great for working collaboratively on large codebases. A
very popular (and open source) VCS is git.

47
Version Control Systems
Git focuses on the “branching” paradigm, where changes can be added and merged
in a controlled and logical fashion. Each team member works on their own version of
the code and can propose or “merge” their changes when new features are ready. Git
handles (most) conflicts automatically.

https://www.atlassian.com/git/tutorials/what-is-version-control
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
https://medium.com/gradeup/version-control-system-get-up-to-speed-with-git-ea25b5cb7329

48
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Some tips for good python code and projects:

● Maintain an organized codebase


● Version Control your code (Git)
● Use virtual environments
● Use unit tests

49
Virtual Environments
Since python relies on external libraries, versions matter. Libraries might get updated
with breaking changes which can be dangerous for legacy code. The solution to this
is isolating each project with their own virtual environment. Even python versions
can have an impact on code execution.

https://realpython.com/effective-python-environment/

50
Managing Projects
Colab and jupyter notebooks are great tools for quick iteration, data visualization
and sharing ideas. They are not ideal for scaling and deploying large projects and
can get cluttered very quickly. Some tips for good python code and projects:

● Maintain an organized codebase


● Version Control your code (Git)
● Use virtual environments
● Use unit tests

51
Unit tests
It is good practice to write unit tests that you expect your code to pass every time you
run it. This is a great way to catch bugs early and ensure that code works as
expected in an explicit way.

https://docs.pytest.org/en/latest/

52
Experiment Management
Deep learning is inherently empirical. When solving a task, you will want to try many
variations of a network, i.e. use different hyperparameters, models, processing, etc.
You will therefore need tools to organize and monitor your experiments.

https://www.tensorflow.org/guide/summaries_and_tensorboard
https://mlflow.org/docs/latest/tutorial.html

53
Hyperparameter Optimization
Manually tweaking parameters can be exhausting and time consuming. Many
libraries also offer the possibility to suggest automatically how to update parameters
based on empirical results.
Observed
Your performances
Model Hyperparameter
Optimizer

Suggested
Hyperparameters

http://www.asimovinstitute.org/neural-network-zoo/
54
Experiment Management Libraries

Name Experiment Experiment Hyperparameter Open Source


Monitoring Organizing Optimization

Comet.ml YES YES YES Freemium

Trains YES YES NO YES

MLFlow YES YES NO YES

Tensorboard YES NO (Limited) NO YES

Visdom YES NO (Limited) NO YES

scikit-optimize NO NO YES YES

AX.dev NO YES (Limited - YES YES


No GUI)

Orion NO NO YES YES

55
Quebec
Artificial
Intelligence
Institute

Hardware
Hardware
A computer typically consists of a CPU and a GPU. They were both designed to
perform different types of operations. The CPU is designed to handle fewer
operations rapidly and sequentially while the GPU can handle many operations in
parallel but can be slower.

57
Hardware
A computer typically consists of a CPU and a GPU. They were both designed to
perform different types of operations. The CPU is designed to handle fewer
operations rapidly and sequentially while the GPU can handle many operations in
parallel but can be slower.

58
GPU vs. CPU
When training deep learning models, most operations are independent and can be
massively parallelized. Computations can be done in batches and results aggregated
for much faster results.

59
CPU vs. GPU
GPUs were originally developed by the gaming industry. The term first appeared with
the emergence of the PlayStation 1. They are optimized to perform parallel
computations. GPUs happen to be very efficient at computing matrix and vector
operations in parallel which can make them very suitable for machine learning and
deep learning.

60
CPU vs. GPU

61
CPU vs. GPU
Modern deep learning frameworks use highly optimized code to parallelize
computations automatically on the GPU.

62
CPU vs. GPU vs. TPU
As deep learning becomes more prevalent, chips
have become increasingly specialized for deep
learning. Google has developed their own Tensor
Processing Unit (TPU) to rival the GPU.

https://arxiv.org/pdf/1907.10701.pdf

63
Performance
Training deep learning models involves more than just matrix and tensor operations.
Data has to be preprocessed before it can be computed by the GPU. Most modern
frameworks will enable preprocessing of a new batch of data to be done on the CPU
while the GPU computes results on the previous batch to maximize resource
utilization.

https://www.tensorflow.org/beta/guide/data_performance

64
Edge Computing
While GPU/TPUs enable batch computations to be sped up, it should be mentioned
that CPUs can still be very fast and cost effective for inference. It can also be hard to
guarantee access to GPUs when doing edge computing, i.e on-device computation.

65
Cloud Computing
Deploying computations on the cloud has pros and cons.

Pros:
● Get access to the latest hardware
● Low initial cost
● Scale up and down based on demand

Cons
● Can be prohibitively expensive when running 24/7

66
Cloud Computing Cost + Hardware

https://dawn.cs.stanford.edu/benchmark/
https://towardsdatascience.com/maximize-your-gpu-dollars-a9133f4e546a

67
Cloud Computing Pipeline

Prototype

Deploy
Train

Preprocess
Evaluate

Validate
RESTful
API

Database

68
Quebec
Artificial
Intelligence
Institute

Questions?

You might also like