0% found this document useful (0 votes)

2 views89 pages

Tutorial hpcs2011 Fixed

The document provides an overview of Theano, a mathematical expression compiler designed for efficient GPU programming. It discusses Theano's capabilities, advantages over other frameworks, and its integration with PyCUDA and CUDA for optimized performance. Theano aims to simplify coding while enhancing execution speed, making it a valuable tool for deep learning and machine learning applications.

Uploaded by

hopillow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views89 pages

Tutorial hpcs2011 Fixed

Uploaded by

hopillow

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

Overview

Introduction
Theano
Advanced Theano
PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

GPU Programming made Easy

Frédéric Bastien
Laboratoire d’Informatique des Systèmes Adaptatifs
Département d’informatique et de recherche opérationelle
James Bergstra, Olivier Breuleux, Frederic Bastien,
Arnaud Bergeron, Yoshua Bengio, Thierry Bertin-Mahieux, Josh Bleecher Snyder, Olivier
Delalleau, Guillaume Desjardins, Douglas Eck, Dumitru Erhan, Xavier Glorot, Ian Goodfellow,
Philippe Hamel, Pascal Lamblin, Simon Lemieux, Michael Mandel, Razvan Pascanu, François
Savard, Joseph Turian, David Warde-Farley

Presented on June 13th 2011

HPCS 2011, Montréal

1 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano Goal

I Tries to be the holy grail in computing: easy to code and fast to execute !
I Only on mathematical expressions
I So you won’t have:
I Function call inside a theano function
I Structure, enum
I Dynamic type (Theano is Fully typed)
I ...

I And doesn’t do coffee!

2 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Faster on CPU and GPU

3 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Project Status
Why you can rely on Theano:
I Theano has been developed and used since January 2008 (3.5 yrs old)

I Core technology for a funded Silicon-Valley startup

I Driven over 40 research papers in the last few years

I Good user documentation

I Active mailing list with participants from outside our lab

I Many contributors (some from outside our lab)

I Used to teach IFT6266 for two years

I Used by everyone in our lab (˜30 people)

I Deep Learning Tutorials

I Unofficial RPMs for Mandriva

I Downloads (June 8 2011, since last January):
I Pypi 780
I MLOSS: 483
I Assembla (“bleeding edge” repository): unknown
4 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Overview 1

I Exercises as we go
I Introduction
I Why Scripting for GPUs?
I Theano vs. PyCUDA vs. PyOpenCL vs. CUDA
I Python in 1 slide
I NumPy in 1 slide
I Theano
I Introduction
I Simple example
I Real example
I Theano Flags
I GPU
I Symbolic Variables
I Differentiation Details
I Benchmarks
I break?

5 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Overview 2

I Advanced Theano
I Compilation Pipeline
I Inplace Optimization
I Profiling
I Drawing/Printing Theano Graph
I Debugging
I Scan (For-Loop generalization)
I Known Limitations

6 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Overview 3

I PyCUDA
I Introduction
I Example
I CUDA Overview
I Extending Theano
I Theano Graph
I Op Contract
I Op Example
I Theano + PyCUDA
I GpuNdArray
I Conclusion

7 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Overview 4

I Only high level overview of CUDA

I Won’t talk about how to optimize GPU code

8 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

Why GPU

I Faster, cheaper, more efficient power usage

I How much faster? I have seen numbers from 100x slower to 1000x faster.
I It depends on the algorithms
I How the benchmark is done
I Quality of implementation
I How much time was spent optimizing CPU vs GPU code
I In Theory:
I Intel Core i7 980 XE (107Gf/s float64) 6 cores
I NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores
I NVIDIA GTX580 (1.5Tf/s float32) 512 cores
I Theano goes up to 100x faster on th GPU because we don’t use multiple
core on CPU
I Theano can be linked with multi-core capable BLAS (GEMM and GEMV)
I If you see 1000x, it probably means the benchmark is not fair

9 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

Why Scripting for GPUs

They Complement each other

I GPUs are everything that scripting/high level languages are not
I Highly parallel
I Very architecture-sensitive
I Built for maximum FP/memory throughput
I CPU: largely restricted to control
I Optimized for sequential code and low latency (rather than high
throughput)
I Tasks (1000/sec)
I Scripting fast enough

10 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano vs PyCUDA vs PyOpenCL vs CUDA

I Theano
I Mathematical expression compiler
I Generates costum C and CUDA code
I Uses Python code when performance is not critical
I CUDA
I C extension by NVIDA that allow to code and use GPU
I PyCUDA (Python + CUDA)
I Python interface to CUDA
I Memory management of GPU objects
I Compilation of code for the low-level driver
I PyOpenCL (Python + OpenCL)
I PyCUDA for OpenCL

11 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

What is your background ?

Do you have experience with :

I Python
I NumPy / SciPy / Matlab
I Maple / Mathematica / SymPy
I GPU programming / CUDA / OpenCL
I Cython / Weave / Numexpr
I C / Java / Fortran

12 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

Python in 1 Slide

I Interpreted language
I General-purpose high-level programming language
I OO and scripting language
I Emphasizes code readability
I Large and comprehensive standard library
I Indentation for block delimiters
I Dynamic type and memory management
I Dictionary d={’var1’:’value1’, ’var2’:42, ...}
I List comprehension: [i+3 for i in range(10)]

13 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion

NumPy in 1 Slide

I Base scientific computing package in Python on the CPU

I A powerful N-dimensional array object
I ndarray.{ndim, shape, size, dtype, itemsize, stride}
I Sophisticated “broadcasting” functions
I numpy.random.rand(4,5) * numpy.random.rand(1,5) ⇒ mat(4,5)
I numpy.random.rand(4,5) * numpy.random.rand(4,1) ⇒ mat(4,5)
I numpy.random.rand(4,5) * numpy.random.rand(5) ⇒ mat(4,5)
I Tools for integrating C/C++ and Fortran code
I Linear algebra, Fourier transform and pseudorandom number generation

14 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

15 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Pointers

I Website: http://deeplearning.net/software/theano/
I Announcements mailing list:
http://groups.google.com/group/theano-announce
I User mailing list: http://groups.google.com/group/theano-users
I Deep Learning Tutorials: http://www.deeplearning.net/tutorial/

I Installation: https://deeplearning.net/software/theano/install.html

16 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Description

I Mathematical symbolic expression compiler

I Dynamic C/CUDA code generation
I Efficient symbolic differentiation
I Theano computes derivatives of functions with one or many inputs.
I Speed and stability optimizations
I Gives the right answer for log(1 + x) even if x is really tiny.
I Works on Linux, Mac and Windows
I Transparent use of a GPU
I float32 only for now (working on other data types)
I Doesn’t work on Windows for now
I On GPU data-intensive calculations are typically between 6.5x and 44x
faster. We’ve seen speedups up to 140x

17 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Description 2

I Extensive unit-testing and self-verification

I Detects and diagnoses many types of errors
I On CPU, common machine learning algorithms are 1.6x to 7.5x faster
than competitive alternatives
I including specialized implementations in C/C++, NumPy, SciPy, and
Matlab
I Expressions mimic NumPy’s syntax & semantics
I Statically typed and purely functional
I Some sparse operations (CPU only)
I The project was started by James Bergstra and Olivier Breuleux
I For the past 1-2 years, I have replaced Olivier as lead contributor

18 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Why Theano is better

Executing the code is faster because Theano:

I Rearranges high-level expressions
I Produces customized low-level code
I Uses a variety of backend technologies (GPU,...)

Writing the code is faster because:

I High-level language allows to concentrate on the algorithm
I Theano do automatic optimization
I No need to manually optimize for each algorithm you want to test
I Theano do automatic efficient symbolic differentiation
I No need to manually differentiate your functions (tedious & error-prone for
complicated expressions!)

19 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Simple Example

import theano
a = theano.tensor.vector("a") # declare symbolic variable
b = a + a**10 # build symbolic expression
f = theano.function([a], b) # compile function
print f([0,1,2]) # prints ‘array([0,2,1026])‘

20 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Simple Example: Optimized graph

no pow, fused elemwise op!

Symbolic programming
I Paradigm shift: people need to use it to understand it

21 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Exercises 1

source /groups/h/hpc2011/bin/GPU.csh
hg clone http://hg.assembla.com/theano Theano
cd Theano/doc/hpcs2011_tutorial
python simple_example.py

Modify and execute the example to do this expression: a**2 + b**2 + 2*a*b

22 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

I GPU-ready
I Symbolic differentiation
I Speed optimizations
I Stability optimizations

23 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

import numpy
import theano
import theano.tensor as T
rng = numpy.random

N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000

24 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

# Declare Theano symbolic variables

x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(100), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()

25 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

# Declare Theano symbolic variables

x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(100), name="w")
b = theano.shared(0., name="b")

# Construct Theano expression graph

p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)# Cross-entropy loss function
cost = xent.mean() + 0.01*(w**2).sum() # The cost to minimize
gw,gb = T.grad(cost, [w,b])

26 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(100), name="w")
b = theano.shared(0., name="b")
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
prediction = p_1 > 0.5
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
cost = xent.mean() + 0.01*(w**2).sum()
gw,gb = T.grad(cost, [w,b])

# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.1*gw, b:b-0.1*gb})
predict = theano.function(inputs=[x], outputs=prediction)
27 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: Logistic Regression

# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])

print "Final model:"

print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])

28 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: optimization

p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))

xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
prediction = p_1 > 0.5
cost = xent.mean() + 0.01*(w**2).sum()
gw,gb = T.grad(cost, [w,b])

train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.1*gw, b:b-0.1*gb}) # This is a dictionary
Where are those optimization applied?
I Log(1+exp(x))
I 1 / (1 + T.exp(var)) (sigmoid)
I Log(1-sigmoid(var)) (softplus, stabilisation)
I GEMV (matrix-vector multiply from BLAS)
I Loop fusion
29 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

A Real Example: optimization!

p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))

# 1 / (1 + T.exp(var)) -> sigmoid(var)
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
# Log(1-sigmoid(var)) -> -sigmoid(var)

prediction = p_1 > 0.5

cost = xent.mean() + 0.01*(w**2).sum()
gw,gb = T.grad(cost, [w,b])

train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
# w-0.1*gw: GEMV with the dot in the grad
updates={w:w-0.1*gw, b:b-0.1*gb})

I Loop fusion in many places

30 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Theano Flags

Theano can be configured with flags. They can be defined in two ways
I With an environment variable:
THEANO FLAGS="mode=ProfileMode,ProfileMode.profile memory=True"
I With a configuration file that defaults to ˜/.theanorc

31 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Exercises 2

python logreg_example.py

Modify and execute the example in the file logreg example.py to run on CPU
with floatX=float32
* You will need to use: theano.config.floatX and ndarray.astype(”str”)

32 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

GPU

I Only 32 bit floats are supported (being worked on)

I Only 1 GPU per process
I Use the Theano flag device=gpu to tell to use the GPU device
I Use device=gpu0, 1, ... to specify which GPU if you have more than one
I Shared variables with float32 dtype are by default moved to the GPU
memory space
I Use the Theano flag floatX=float32
I Be sure to use floatX (theano.config.floatX) in your code
I Cast inputs before putting them into a shared variable
I Cast ”problem”: int32 with float32 → float64
I A new casting mechanism is being developed
I Insert manual cast in your code or use [u]int8,16
I Insert manual cast around the mean operator (which involves a division by the
length, which is an int64!)

33 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

GPU for Exercises

I Intel Core i7 980 XE (107Gf/s float64, 1050$, 6 cores/12 threads)

I NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32, 2400$, 480 cores),
compute capability 2.0
I NVIDIA GTX580 (1.5Tf/s float32, 500$, 512 cores), compute capability
2.0
Computers in the class
I Intel Xeon X3450 (?56? flops/s, 383$, 4 cores)
I NVIDIA Quadro FX 580 (71GF/s single, 140$, 32 cores), compute
capability 1.1, ’profesionnal card’

34 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Exercises 3

I Modify and execute the code to run with floatX=float32 on GPU

I Time with: time python file.py

35 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Creating symbolic variables

I # Dimensions
I T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
I Dtype
I T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16,
int32, int64)
I T.vector → floatX dtype
I floatX: configurable dtype that can be float32 or float64.
I Custom variable
I All are shortcuts to: T.tensor(dtype, broadcastable=[False]*nd)
I Other dtype: uint[8,16,32,64], floatX

36 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Creating symbolic variables: Broadcastability

I Remember what I said about broadcasting?

I How to add a row to all rows of a matrix?
I How to add a column to all columns of a matrix?

I Broadcastability must be specified when creating the variable

I The only shorcut with broadcastable dimensions are: T.row and T.col
I For all others: T.tensor(dtype, broadcastable=([False or True])*nd)

37 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Differentiation Details

gw,gb = T.grad(cost, [w,b])

I T.grad works symbolically: takes and returns a Theano variable

I T.grad can be compared to a macro: it can be applied multiple times
I T.grad takes scalar costs only
I Simple recipe allows to compute efficiently vector × Jacobian and vector
× Hessian
I We are working on the missing optimizations to be able to compute
efficently the full Jacobian and Hessian and Jacobian × vector

38 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Benchmarks

Example:
I Multi-layer perceptron
I Convolutional Neural Networks
I Misc Elemwise operations
Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
I EBLearn, Torch5: specialized libraries written by practitioners specifically
for these tasks
I numexpr: similar to Theano, ’virtual machine’ for elemwise expressions

39 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Benchmark MLP
Multi-Layer Perceptron: 60x784 matrix times 784x500 matrix, tanh, times
500x10 matrix, elemwise, then all in reverse for backpropagation

40 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Benchmark Convolutional Network

Convolutional Network: 256x256 images convolved with 6 7x7 filters,
downsampled to 6x50x50, tanh, convolution with 16 6x7x7 filter, elementwise
tanh, matrix multiply, softmax elementwise, then in reverse

41 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion

Elemwise Benchmark
I All on CPU
I Solid blue: Theano
I Dashed Red: numexpr (without MKL)

7 a2 + b2 + 2ab 4.0 2a + 3b

Speed up vs NumPy

6 3.5
5 3.0
4 2.5
2.0
3 1.5
2 1.0
1 0.5
0
1e3 1e5 1e7 0.01e3 1e5 1e7
1.0 a+1 40 2*a + b**10
Speed up vs NumPy

0.9 35
0.8 30
0.7 25
0.6 20
0.5
0.4 15
0.3 10
0.2 5
0.11e3 1e5 0
1e7 1e3 1e5 1e7
Dimension of vectors a and b Dimension of vectors a and b 42 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Compilation Pipeline

Canonicalization

Stabilization

Specialization

GPU Transfer

Elemwise Fusion

Inplace

Code Generation

43 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Inplace Optimization

I 2 type of inplace operations:

I An op that return a view on its inputs (e.g. reshape, inplace transpose)
I An op that write the output on the inputs memory space
I This allows some memory optimization
I The Op must tell Theano if they work inplace
I Inplace Op add constraints to the order of execution

44 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode

To replace the default mode with this mode, use the Theano flags
mode=ProfileMode
To enable the memory profiling use the flags
ProfileMode.profile memory=True
Time since import 33.456s
Theano compile time: 1.023s (3.1% since import)
Optimization time: 0.789s
Linker time: 0.221s
Theano fct call 30.878s (92.3% since import)
Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call)
Theano function overhead in ProfileMode 1.466s 4.4%(since import)
4.7%(of fct call)
10001 Theano fct call, 0.003s per call
Rest of the time since import 1.555s 4.6%

45 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Function Summary

Theano outputs:

Theano fct summary:

<% total fct time> <total time> <time per call> <nb call> <fct name>
100.0% 30.877s 3.09e-03s 10000 train
0.0% 0.000s 4.06e-04s 1 predict

46 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Single Op-Wise Summary

Theano outputs:

Single Op-wise summary:

<% of local_time spent on this kind of Op> <cumulative %>
<self seconds> <cumulative seconds> <time per call> <nb_call>
<nb_op> <nb_apply> <Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 1 <Gemv>
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10001 1 2 <Dot>
2.4% 99.3% 0.691s 29.206s 7.68e-06s * 90001 10 10 <Elemwise>
0.4% 99.7% 0.127s 29.334s 1.27e-05s 10000 1 1 <Alloc>
0.2% 99.9% 0.053s 29.386s 1.75e-06s * 30001 2 4 <DimShuffle>
0.0% 100.0% 0.014s 29.400s 1.40e-06s * 10000 1 1 <Sum>
0.0% 100.0% 0.011s 29.411s 1.10e-06s * 10000 1 1 <Shape_i>
(*) Op is running a c implementation

47 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Op-Wise Summary

Theano outputs:

Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %>
<self seconds> <cumulative seconds> <time per call>
<nb_call> <nb apply> <Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 Gemv{inplace}
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10001 2 dot
1.3% 98.2% 0.378s 28.893s 3.78e-05s * 10000 1 Elemwise{Composi
scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}
0.4% 98.7% 0.127s 29.021s 1.27e-05s 10000 1 Alloc
0.3% 99.0% 0.092s 29.112s 9.16e-06s * 10000 1 Elemwise{Composi
exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)]
0.1% 99.3% 0.033s 29.265s 1.66e-06s * 20001 3 InplaceDimShuffl
... (remaining 11 Apply account for 0.7%(0.00s) of the runtime)
(*) Op is running a c implementation

48 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Apply-Wise Summary

Theano outputs:

Apply-wise summary:
<% of local_time spent at this position> <cumulative %%>
<apply time> <cumulative seconds> <time per call>
<nb_call> <Apply position> <Apply Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 15 Gemv{inplace}(
w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Com
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10000 1 dot(x, w)
1.3% 98.2% 0.378s 28.893s 3.78e-05s 10000 9 Elemwise{Composit
0.4% 98.7% 0.127s 29.020s 1.27e-05s 10000 10 Alloc(Elemwise{in
0.3% 99.0% 0.092s 29.112s 9.16e-06s 10000 13 Elemwise{Composit
0.3% 99.3% 0.080s 29.192s 7.99e-06s 10000 11 Elemwise{ScalarSi
... (remaining 14 Apply instances account for
0.7%(0.00s) of the runtime)

49 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Memory Profile

Theano outputs:
Profile of Theano functions memory:
(This check only the output of each apply node. It don’t check the
temporary memory used by the op in the apply node.)
Theano fct: train
Max without gc, inplace and view (KB) 2481
Max FAST_RUN_NO_GC (KB) 16
Max FAST_RUN (KB) 16
Memory saved by view (KB) 2450
Memory saved by inplace (KB) 15
Memory saved by GC (KB) 0
<Sum apply outputs (bytes)> <Apply outputs memory size(bytes)>
<created/inplace/view> <Apply node>
<created/inplace/view> is taked from the op declaration, not ...
2508800B [2508800] v InplaceDimShuffle{1,0}(x)
6272B [6272] i Gemv{inplace}(w, ...)
3200B [3200] c Elemwise{Composite{...}}(y, ...) 50 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Profile Mode: Tips

Theano outputs:

Here are tips to potentially make your code run faster

(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32

51 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Exercises 4

I In the last exercises, do you see a speed up with the GPU?

I Where does it come from? (Use ProfileMode)
I Is there something we can do to speed up the GPU version?

52 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Text Printing of Your Theano Graph: Pretty Printing

theano.printing.pprint(variable)

>>> theano.printing.pprint(prediction)
gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b))))
TensorConstant{0.5})

53 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Text Printing of Your Theano Graph: Debug Print

theano.printing.debugprint(fct, variable, list of variables)
>>> theano.printing.debugprint(prediction)
Elemwise{gt,no_inplace} [@181772236] ’’
|Elemwise{true_div,no_inplace} [@181746668] ’’
| |InplaceDimShuffle{x} [@181746412] ’’
| | |TensorConstant{1} [@181745836]
| |Elemwise{add,no_inplace} [@181745644] ’’
| | |InplaceDimShuffle{x} [@181745420] ’’
| | | |TensorConstant{1} [@181744844]
| | |Elemwise{exp,no_inplace} [@181744652] ’’
| | | |Elemwise{sub,no_inplace} [@181744012] ’’
| | | | |Elemwise{neg,no_inplace} [@181730764] ’’
| | | | | |dot [@181729676] ’’
| | | | | | |x [@181563948]
| | | | | | |w [@181729964]
| | | | |InplaceDimShuffle{x} [@181743788] ’’
| | | | | |b [@181730156]
|InplaceDimShuffle{x} [@181771788] ’’
| |TensorConstant{0.5} [@181771148]
54 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Text Printing of Your Theano Graph: Debug Print

theano.printing.debugprint(fct, variable, list of variables)

>>> theano.printing.debugprint(predict)
Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ’’ 2
|dot [@183018796] ’’ 1
| |x [@183000780]
| |w [@183000812]
|InplaceDimShuffle{x} [@183133580] ’’ 0
| |b [@183000876]
|TensorConstant{[ 0.5]} [@183084108]

55 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Picture Printing of Graphs

>>> theano.printing.pydotprint_variables(prediction)

56 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Picture Printing of Graphs

All pydotprint* requires graphviz and pydot

>>> theano.printing.pydotprint(predict)

57 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Picture Printing of Graphs

>>> theano.printing.pydotprint(train) # This is a small train example!

58 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

How to Debug

I Run with the flag mode=DebugMode

I 100-1000x slower
I Test all optimization steps from the original graph to the final graph
I Checks many things that Op should/shouldn’t do
I Executes both the Python and C code versions
I Run with the Theano flag compute test value = ‘‘off’’,
‘‘ignore’’, ‘‘warn’’, ‘‘raise’’
I Run the code as we create the graph
I Allows you to find the bug earlier (ex: shape mismatch)
I Makes it easier to identify where the problem is in your code
I Use the value of constants and shared variables directly
I For pure symbolic variables uses x.tag.test value =
numpy.random.rand(5,10)
I Run with the flag mode=FAST COMPILE
I Few optimizations
I Run Python code (better error messages and can be debugged
interactively in the Python debugger)

59 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Scan
I General form of recurrence, which can be used for looping.
I Reduction and map(loop over the leading dimensions) are special cases of
Scan
I You *scan* a function along some input sequence, producing an output at
each time-step
I The function can see the previous K time-steps of your function
I “sum()“ could be computed by scanning the z + xi function over a list,
given an initial state of “z=0“.
I Often a for-loop can be expressed as a “scan()“ operation, and “scan“ is
the closest that Theano comes to looping.
I The advantage of using “scan“ over for loops
I The number of iterations to be part of the symbolic graph
I Minimizes GPU transfers if GPU is involved
I Compute gradients through sequential steps
I Slightly faster then using a for loop in Python with a compiled Theano
function
I Can lower the overall memory usage by detecting the actual
amount of memory needed
60 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Scan Example: Computing pow(A,k)

k = T.iscalar("k"); A = T.vector("A")

def inner_fct(prior_result, A): return prior_result * A

# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
outputs_info=T.ones_like(A),
non_sequences=A, n_steps=k)

# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]

power = theano.function(inputs=[A,k], outputs=final_result,

updates=updates)

print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.] 61 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Scan Example: Calculating a Polynomial

coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000

# Generate the components of the polynomial

full_range=theano.tensor.arange(max_coefficients_supported)
components, updates = theano.scan(fn=lambda coeff, power, free_var:
coeff * (free_var ** power),
outputs_info=None,
sequences=[coefficients, full_range],
non_sequences=x)
polynomial = components.sum()
calculate_polynomial = theano.function(inputs=[coefficients, x],
outputs=polynomial)

test_coeff = numpy.asarray([1, 0, 2], dtype=numpy.float32)

print calculate_polynomial(test_coeff, 3)
# 19.0 62 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Exercises 5

I Run the example in the file scan pow.py and scan poly.py
I Modify and execute the polynomial example to have the reduction done by
scan

63 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion

Known Limitations

I Compilation phase distinct from execution phase

I Compilation time can be significant
I Amortize it with functions over big input or reuse functions
I Execution overhead
I Needs a certain number of operations to be useful
I We have started working on this in a branch
I Compilation time superlinear in the size of the graph.
I A few hundreds nodes is fine
I Disabling a few optimizations can speed up compilation
I Usually too many nodes indicates a problem with the graph
I Lazy evaluation in a branch (We will try to merge this summer)

64 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

PyCUDA

65 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Intro
Authors: Andreas Klöckner
I PyCUDA can access Nvidia’s CUDA parallel computation API from Python
I Object cleanup tied to lifetime of objects (RAII, Resource Acquisition Is
Initialization).
I Makes it much easier to write correct, leak- and crash-free code
I PyCUDA knows about dependencies (e.g.. it won’t detach from a context
before all memory allocated in it is also freed)
I Convenience
I Abstractions to compile CUDA code from Python:
pycuda.driver.SourceModule
I A GPU memory buffer: pycuda.gpuarray.GPUArray
I Completeness
I Binding to all of CUDA’s driver API
I Automatic Error Checking
I All CUDA errors are automatically translated into Python exceptions
I Speed
I PyCUDA’s base layer is written in C++
I Helpful documentation
66 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Example

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule

mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")

67 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Example

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))

68 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

GPU Programming: Gains and Losses

I Gains:
I Memory Bandwidth (140GB/s vs 12 GB/s)
I Compute Bandwidth( Peak: 1 TF/s vs 0.1 TF/s in float)
I Data-parallel programming
I Losses:
I No performance portability guaranty
I Data size influence more the implementation code on GPU
I Cheap branches
I Fine-grained malloc/free*
I Recursion*
I Function pointers*
I IEEE 754FP compliance*
* Less problematic with new hardware (NVIDIA Fermi)
[slide from Andreas Klöckner]

69 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

CPU vs GPU Architecture

Source NVIDIA CUDA C Programming Guide.pdf document

70 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Different GPU Block Repartition

Source NVIDIA CUDA C Programming Guide.pdf document

71 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

GPU thread structure

Source NVIDIA CUDA C Programming Guide.pdf document 72 / 89

Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion

Exercises 6

I Run the example in the file pycuda simple.py

I Modify and execute it to work for a matrix of 20 × 10

73 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano Graph
I Theano works with symbolic graphs
I Those graphs are bi-partite graphs (graph with 2 types of nodes)
I Those 2 nodes types are Apply and Variable nodes
I Inputs and Outputs are lists of Theano variables

Apply Node Op

Outputs Inputs

74 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Op Contract

class MyOp(Op):
def __eq__(self, other):
def __hash__(self):
def __str__(self):
def make_node(self, *inputs):

# Python implementation:
def perform(self, node, inputs_storage, outputs_storage):
# C implementation: [see theano web site]
# others implementation (pycuda, ...):
def make_thunk(self, node, storage_map, _, _2):

# optional:
def __init__(self, ...):
def grad(self, inputs, g):
def infer_shape(node, (i0_shapes, ...))
75 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Op Example

import theano

class DoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
76 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano Op Example: Test it!

x = theano.tensor.matrix()
f = theano.function([x],DoubleOp()(x))

import numpy
inp = numpy.random.rand(5,5)
out = f(inp)
assert numpy.allclose(inp*2, out)
print inp
print out

77 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Exercises 7

I Run the code in the file double op.py.

I Modify and execute to compute: x ∗ y
I Modify and execute the example to return 2 outputs: x + y and x − y
I Our current elemwise fusion generate computation with only 1 outputs

78 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano+PyCUDA Op Example

import numpy, theano

import theano.misc.pycuda_init
from pycuda.compiler import SourceModule
import theano.sandbox.cuda as cuda

class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()]) 79 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano + PyCUDA Op Example: make thunk

def make_thunk(self, node, storage_map, _, _2):

mod = SourceModule( THE_C_CODE )

pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size)
block=(512,1,1), grid=grid)
return thunk

80 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano + PyCUDA Op Example: GPU Code

THE_C_CODE = """
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")

81 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Theano + PyCUDA Op Example: Test it!

x = theano.tensor.fmatrix()
f = theano.function([x], PyCUDADoubleOp()(x))
xv=numpy.ones((4,5), dtype="float32")

assert numpy.allclose(f(xv), xv*2)

print numpy.asarray(f(xv))

82 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion

Exercises 8

I Run the example in the file pycuda double op.py

I Modify and execute the example to multiple two matrix: x ∗ y
I Modify and execute the example to return 2 outputs: x + y and x − y
I Our current elemwise fusion generate computation with only 1 outputs
I Modify and execute the example to support stride? (Don’t force the input
to be c contiguous)

83 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion

Why a common GPU ndarray?

I Currently there are at least 4 different GPU array data structures in use by
Python packages
I CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat),
GPUArray (PyOpenCL), ...
I There are even more if we include other languages
I All of them are a subset of the functionality of numpy.ndarray on the
GPU
I Lots of duplicated effort
I GPU code is harder/slower to do correctly and fast than on the
CPU/Python
I Lack of a common array API makes it harder to port/reuse code
I Also harder to find/distribute code
I Divides development work

84 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion

Design Goals

I Make it VERY similar to numpy.ndarray

I Be compatible with both CUDA and OpenCL
I Have the base object accessible from C to allow collaboration with more
projects, across high-level languages
I We want people from C, C++, Ruby, R, ... all use the same base GPU
N-dimensional array

85 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion

Final GpuNdArray Note

I Under development
I Will be the next GPU array container for Theano (this summer!)
I Probably also for PyCUDA, PyOpenCL
I Mailing list: http://lists.tiker.net/listinfo/gpundarray

86 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion

Conclusion

I I presented a tool that tries to be the holy grail in computing: easy to

code and fast to execute!
I Generates fast, custom CPU code and GPU code
I You can easily wrap existing CPU/GPU code with Theano
I It works and is used in the real world by academic researchers and
industry

87 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion

Thanks

I Thanks for attending this tutorial

I Thanks to our agencies that resources for this projects: Calcul Québec,
CIFAR, Compute Canada, FQRNT, MITACS, NSERC, SciNet,
SHARCNET, Ubisoft and WestGrid.

88 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion

Questions/Comments?

89 / 89

PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Ultimate Guide To Tensorflow 2.0 in Python
No ratings yet
Ultimate Guide To Tensorflow 2.0 in Python
23 pages
Theano Tutorial
No ratings yet
Theano Tutorial
29 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Deep Learning With Keras and Tensorflow
No ratings yet
Deep Learning With Keras and Tensorflow
557 pages
Theano
No ratings yet
Theano
660 pages
Theano Documentation 082 Itebooks PDF Download
No ratings yet
Theano Documentation 082 Itebooks PDF Download
91 pages
Fundamentals of Accelerated Computing With CUDA Python
No ratings yet
Fundamentals of Accelerated Computing With CUDA Python
2 pages
PyCUDA Tutorial
100% (1)
PyCUDA Tutorial
15 pages
Application of ADSP 21XX
0% (1)
Application of ADSP 21XX
14 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Theano: A CPU and GPU Math Compiler in Python
No ratings yet
Theano: A CPU and GPU Math Compiler in Python
7 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
Install GPU Support For Theano
No ratings yet
Install GPU Support For Theano
6 pages
Theano (Software)
No ratings yet
Theano (Software)
2 pages
Using CPM
No ratings yet
Using CPM
129 pages
Python, Performance, and GPUs - Towards Data Science
No ratings yet
Python, Performance, and GPUs - Towards Data Science
8 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
521010J Toolbox Intro
No ratings yet
521010J Toolbox Intro
52 pages
WGU D415 Software Defined Networking Study Notes
No ratings yet
WGU D415 Software Defined Networking Study Notes
8 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Appendix Tensorflow PDF
50% (8)
Appendix Tensorflow PDF
14 pages
Machine Learning Assignment-1
No ratings yet
Machine Learning Assignment-1
7 pages
Exp No 1
No ratings yet
Exp No 1
6 pages
L6 Hardware and Software For DL en
No ratings yet
L6 Hardware and Software For DL en
66 pages
S06 DNN Tensorflow PyTorch Wip
No ratings yet
S06 DNN Tensorflow PyTorch Wip
24 pages
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
No ratings yet
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
11 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Let Us Code: Using Deep Learning Through A Library
No ratings yet
Let Us Code: Using Deep Learning Through A Library
17 pages
2024 - GR5245 Class2 Notes
No ratings yet
2024 - GR5245 Class2 Notes
10 pages
Libraries
No ratings yet
Libraries
10 pages
DIP Lab 10
No ratings yet
DIP Lab 10
11 pages
Tensorlayer Documentation: Release 1.11.1
No ratings yet
Tensorlayer Documentation: Release 1.11.1
258 pages
The First Artificial Neuron
No ratings yet
The First Artificial Neuron
2 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Operating Systems: Chapter 2 - Operating System Structures
No ratings yet
Operating Systems: Chapter 2 - Operating System Structures
56 pages
Unit 5: Simulating and Debugging A Migration Object
No ratings yet
Unit 5: Simulating and Debugging A Migration Object
14 pages
Week 2
No ratings yet
Week 2
4 pages
Deep Learning Library PDF
No ratings yet
Deep Learning Library PDF
12 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Zone - Library of Resources - NVIDIA Developer
No ratings yet
CUDA Zone - Library of Resources - NVIDIA Developer
7 pages
The Game of Sudoku
No ratings yet
The Game of Sudoku
4 pages
Theano Documentation
No ratings yet
Theano Documentation
644 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Ecom Driver Installation Manual
No ratings yet
Ecom Driver Installation Manual
11 pages
09 Tensorflow101 Slide
No ratings yet
09 Tensorflow101 Slide
78 pages
Autoencoders: Parallel Programming Parallel Processing
No ratings yet
Autoencoders: Parallel Programming Parallel Processing
5 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Owens
No ratings yet
Owens
67 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Apurv Notes - Foundations of Pytorch
No ratings yet
Apurv Notes - Foundations of Pytorch
15 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
Dzone Rc251 Gettingstartedwithtensorflow
No ratings yet
Dzone Rc251 Gettingstartedwithtensorflow
5 pages
Microsoft Word Document
No ratings yet
Microsoft Word Document
6 pages
Ceklist Unit CLS
No ratings yet
Ceklist Unit CLS
51 pages
S1 - Access Work Environment
No ratings yet
S1 - Access Work Environment
52 pages
1.software Test and Analysis in A Nutshell
No ratings yet
1.software Test and Analysis in A Nutshell
25 pages
Data Structure Lab Manual
No ratings yet
Data Structure Lab Manual
28 pages
How To Optimize SQL Server Query Performance - Statistics, Joins and Index Tuning
No ratings yet
How To Optimize SQL Server Query Performance - Statistics, Joins and Index Tuning
25 pages
FortiDeceptor-4.3.0-Threat Prevention and Detection With FortiDeceptor
No ratings yet
FortiDeceptor-4.3.0-Threat Prevention and Detection With FortiDeceptor
108 pages
Board Practical Qp-20-23 Xii
No ratings yet
Board Practical Qp-20-23 Xii
10 pages
Chapter 11 - Dynamic Memory Management
No ratings yet
Chapter 11 - Dynamic Memory Management
16 pages
OpenSim Server Commands, 0.7.1-Plus
No ratings yet
OpenSim Server Commands, 0.7.1-Plus
9 pages
Appliance Control System by Android Phone Utilizing Arduino
No ratings yet
Appliance Control System by Android Phone Utilizing Arduino
5 pages
Risc & Sisc Characteristics
No ratings yet
Risc & Sisc Characteristics
9 pages
Section Tools Primer
No ratings yet
Section Tools Primer
18 pages
DevOps Roadmap 2024 - TrainWithShubham
No ratings yet
DevOps Roadmap 2024 - TrainWithShubham
5 pages
PartsExport - Compatible 81de - 2021 01 04 02 54 25
No ratings yet
PartsExport - Compatible 81de - 2021 01 04 02 54 25
26 pages
Merchant Rating System Using Hadoop MapReduce
No ratings yet
Merchant Rating System Using Hadoop MapReduce
32 pages
1.EDisha MoRTH SAP PS ProjectCreation V121092020
No ratings yet
1.EDisha MoRTH SAP PS ProjectCreation V121092020
18 pages
Introduction To Industrial Automation and Robotics Session-4
No ratings yet
Introduction To Industrial Automation and Robotics Session-4
6 pages
Unit-V Code Generation: 4.5. Issues in The Design of A Code Generator
No ratings yet
Unit-V Code Generation: 4.5. Issues in The Design of A Code Generator
6 pages
TUXEDO Atlas XL - Gen1: Intel
No ratings yet
TUXEDO Atlas XL - Gen1: Intel
5 pages
RMM Features: Products Industries Training & Support Tools & Resources Expertise
No ratings yet
RMM Features: Products Industries Training & Support Tools & Resources Expertise
3 pages
TVE8 CSS Q1 M3 Week-6-8-Student
No ratings yet
TVE8 CSS Q1 M3 Week-6-8-Student
16 pages
Soumya Java Full Stack Developer 3.8yrs Exp
No ratings yet
Soumya Java Full Stack Developer 3.8yrs Exp
1 page

Tutorial hpcs2011 Fixed

Uploaded by

Tutorial hpcs2011 Fixed

Uploaded by

Overview

GPU Programming made Easy

Presented on June 13th 2011

I And doesn’t do coffee!

Faster on CPU and GPU

I Core technology for a funded Silicon-Valley startup

I Driven over 40 research papers in the last few years

I Good user documentation

I Active mailing list with participants from outside our lab

I Many contributors (some from outside our lab)

I Used to teach IFT6266 for two years

I Used by everyone in our lab (˜30 people)

I Deep Learning Tutorials

I Unofficial RPMs for Mandriva

I Only high level overview of CUDA

I Faster, cheaper, more efficient power usage

Why Scripting for GPUs

They Complement each other

Theano vs PyCUDA vs PyOpenCL vs CUDA

What is your background ?

Do you have experience with :

I Base scientific computing package in Python on the CPU

I Mathematical symbolic expression compiler

I Extensive unit-testing and self-verification

Why Theano is better

Executing the code is faster because Theano:

Writing the code is faster because:

Simple Example: Optimized graph

no pow, fused elemwise op!

A Real Example: Logistic Regression

A Real Example: Logistic Regression

A Real Example: Logistic Regression

# Declare Theano symbolic variables

A Real Example: Logistic Regression

# Declare Theano symbolic variables

# Construct Theano expression graph

A Real Example: Logistic Regression

A Real Example: Logistic Regression

print "Final model:"

A Real Example: optimization

p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))

A Real Example: optimization!

p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))

prediction = p_1 > 0.5

I Loop fusion in many places

I Only 32 bit floats are supported (being worked on)

GPU for Exercises

I Intel Core i7 980 XE (107Gf/s float64, 1050$, 6 cores/12 threads)

I Modify and execute the code to run with floatX=float32 on GPU

Creating symbolic variables

Creating symbolic variables: Broadcastability

I Remember what I said about broadcasting?

I Broadcastability must be specified when creating the variable

gw,gb = T.grad(cost, [w,b])

I T.grad works symbolically: takes and returns a Theano variable

Benchmark Convolutional Network

7 a**2 + b**2 + 2*a*b 4.0 2*a + 3*b

I 2 type of inplace operations:

Profile Mode: Function Summary

Theano fct summary:

Profile Mode: Single Op-Wise Summary

Single Op-wise summary:

Profile Mode: Op-Wise Summary

Profile Mode: Apply-Wise Summary

Profile Mode: Memory Profile

Profile Mode: Tips

Here are tips to potentially make your code run faster

I In the last exercises, do you see a speed up with the GPU?

Text Printing of Your Theano Graph: Pretty Printing

Text Printing of Your Theano Graph: Debug Print

Text Printing of Your Theano Graph: Debug Print

theano.printing.debugprint(fct, variable, list of variables)

Picture Printing of Graphs

Picture Printing of Graphs

All pydotprint* requires graphviz and pydot

Picture Printing of Graphs

7 a2 + b2 + 2ab 4.0 2a + 3b