Tutorial hpcs2011 Fixed
Tutorial hpcs2011 Fixed
Introduction
Theano
Advanced Theano
PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
1 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Theano Goal
I Tries to be the holy grail in computing: easy to code and fast to execute !
I Only on mathematical expressions
I So you won’t have:
I Function call inside a theano function
I Structure, enum
I Dynamic type (Theano is Fully typed)
I ...
3 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Project Status
Why you can rely on Theano:
I Theano has been developed and used since January 2008 (3.5 yrs old)
Overview 1
I Exercises as we go
I Introduction
I Why Scripting for GPUs?
I Theano vs. PyCUDA vs. PyOpenCL vs. CUDA
I Python in 1 slide
I NumPy in 1 slide
I Theano
I Introduction
I Simple example
I Real example
I Theano Flags
I GPU
I Symbolic Variables
I Differentiation Details
I Benchmarks
I break?
5 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Overview 2
I Advanced Theano
I Compilation Pipeline
I Inplace Optimization
I Profiling
I Drawing/Printing Theano Graph
I Debugging
I Scan (For-Loop generalization)
I Known Limitations
6 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Overview 3
I PyCUDA
I Introduction
I Example
I CUDA Overview
I Extending Theano
I Theano Graph
I Op Contract
I Op Example
I Theano + PyCUDA
I GpuNdArray
I Conclusion
7 / 89
Overview
Introduction
Theano
Advanced Theano
Motivation
PyCUDA
Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Overview 4
8 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
Why GPU
9 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
10 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
I Theano
I Mathematical expression compiler
I Generates costum C and CUDA code
I Uses Python code when performance is not critical
I CUDA
I C extension by NVIDA that allow to code and use GPU
I PyCUDA (Python + CUDA)
I Python interface to CUDA
I Memory management of GPU objects
I Compilation of code for the low-level driver
I PyOpenCL (Python + OpenCL)
I PyCUDA for OpenCL
11 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
12 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
Python in 1 Slide
I Interpreted language
I General-purpose high-level programming language
I OO and scripting language
I Emphasizes code readability
I Large and comprehensive standard library
I Indentation for block delimiters
I Dynamic type and memory management
I Dictionary d={’var1’:’value1’, ’var2’:42, ...}
I List comprehension: [i+3 for i in range(10)]
13 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Introduction
CUDA
Extending Theano
GpuNdArray
Conclusion
NumPy in 1 Slide
14 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
15 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Pointers
I Website: http://deeplearning.net/software/theano/
I Announcements mailing list:
http://groups.google.com/group/theano-announce
I User mailing list: http://groups.google.com/group/theano-users
I Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
I Installation: https://deeplearning.net/software/theano/install.html
16 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Description
17 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Description 2
18 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
19 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Simple Example
import theano
a = theano.tensor.vector("a") # declare symbolic variable
b = a + a**10 # build symbolic expression
f = theano.function([a], b) # compile function
print f([0,1,2]) # prints ‘array([0,2,1026])‘
20 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Symbolic programming
I Paradigm shift: people need to use it to understand it
21 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Exercises 1
source /groups/h/hpc2011/bin/GPU.csh
hg clone http://hg.assembla.com/theano Theano
cd Theano/doc/hpcs2011_tutorial
python simple_example.py
Modify and execute the example to do this expression: a**2 + b**2 + 2*a*b
22 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
I GPU-ready
I Symbolic differentiation
I Speed optimizations
I Stability optimizations
23 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
import numpy
import theano
import theano.tensor as T
rng = numpy.random
N = 400
feats = 784
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = 10000
24 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
25 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
26 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(100), name="w")
b = theano.shared(0., name="b")
p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b))
prediction = p_1 > 0.5
xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1)
cost = xent.mean() + 0.01*(w**2).sum()
gw,gb = T.grad(cost, [w,b])
# Compile
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.1*gw, b:b-0.1*gb})
predict = theano.function(inputs=[x], outputs=prediction)
27 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
28 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates={w:w-0.1*gw, b:b-0.1*gb}) # This is a dictionary
Where are those optimization applied?
I Log(1+exp(x))
I 1 / (1 + T.exp(var)) (sigmoid)
I Log(1-sigmoid(var)) (softplus, stabilisation)
I GEMV (matrix-vector multiply from BLAS)
I Loop fusion
29 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
# w-0.1*gw: GEMV with the dot in the grad
updates={w:w-0.1*gw, b:b-0.1*gb})
Theano Flags
Theano can be configured with flags. They can be defined in two ways
I With an environment variable:
THEANO FLAGS="mode=ProfileMode,ProfileMode.profile memory=True"
I With a configuration file that defaults to ˜/.theanorc
31 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Exercises 2
python logreg_example.py
Modify and execute the example in the file logreg example.py to run on CPU
with floatX=float32
* You will need to use: theano.config.floatX and ndarray.astype(”str”)
32 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
GPU
33 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
34 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Exercises 3
35 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
I # Dimensions
I T.scalar, T.vector, T.matrix, T.tensor3, T.tensor4
I Dtype
I T.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16,
int32, int64)
I T.vector → floatX dtype
I floatX: configurable dtype that can be float32 or float64.
I Custom variable
I All are shortcuts to: T.tensor(dtype, broadcastable=[False]*nd)
I Other dtype: uint[8,16,32,64], floatX
36 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
37 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Differentiation Details
38 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Benchmarks
Example:
I Multi-layer perceptron
I Convolutional Neural Networks
I Misc Elemwise operations
Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr
I EBLearn, Torch5: specialized libraries written by practitioners specifically
for these tasks
I numexpr: similar to Theano, ’virtual machine’ for elemwise expressions
39 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Benchmark MLP
Multi-Layer Perceptron: 60x784 matrix times 784x500 matrix, tanh, times
500x10 matrix, elemwise, then all in reverse for backpropagation
40 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
41 / 89
Overview
Introduction
Introduction
Simple Example
Theano
Real Example
Advanced Theano
Theano Flags
PyCUDA
GPU
CUDA
Symbolic Variables
Extending Theano
Differentiation Details
GpuNdArray
Benchmarks
Conclusion
Elemwise Benchmark
I All on CPU
I Solid blue: Theano
I Dashed Red: numexpr (without MKL)
6 3.5
5 3.0
4 2.5
2.0
3 1.5
2 1.0
1 0.5
0
1e3 1e5 1e7 0.01e3 1e5 1e7
1.0 a+1 40 2*a + b**10
Speed up vs NumPy
0.9 35
0.8 30
0.7 25
0.6 20
0.5
0.4 15
0.3 10
0.2 5
0.11e3 1e5 0
1e7 1e3 1e5 1e7
Dimension of vectors a and b Dimension of vectors a and b 42 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Compilation Pipeline
Canonicalization
Stabilization
Specialization
GPU Transfer
Elemwise Fusion
Inplace
Code Generation
43 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Inplace Optimization
44 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Profile Mode
To replace the default mode with this mode, use the Theano flags
mode=ProfileMode
To enable the memory profiling use the flags
ProfileMode.profile memory=True
Time since import 33.456s
Theano compile time: 1.023s (3.1% since import)
Optimization time: 0.789s
Linker time: 0.221s
Theano fct call 30.878s (92.3% since import)
Theano Op time 29.411s 87.9%(since import) 95.3%(of fct call)
Theano function overhead in ProfileMode 1.466s 4.4%(since import)
4.7%(of fct call)
10001 Theano fct call, 0.003s per call
Rest of the time since import 1.555s 4.6%
45 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Theano outputs:
46 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Theano outputs:
47 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Theano outputs:
Op-wise summary:
<% of local_time spent on this kind of Op> <cumulative %>
<self seconds> <cumulative seconds> <time per call>
<nb_call> <nb apply> <Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 1 Gemv{inplace}
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10001 2 dot
1.3% 98.2% 0.378s 28.893s 3.78e-05s * 10000 1 Elemwise{Composi
scalar_softplus,{mul,scalar_softplus,{neg,mul,sub}}}}
0.4% 98.7% 0.127s 29.021s 1.27e-05s 10000 1 Alloc
0.3% 99.0% 0.092s 29.112s 9.16e-06s * 10000 1 Elemwise{Composi
exp,{mul,{true_div,neg,{add,mul}}}}}[(0, 0)]
0.1% 99.3% 0.033s 29.265s 1.66e-06s * 20001 3 InplaceDimShuffl
... (remaining 11 Apply account for 0.7%(0.00s) of the runtime)
(*) Op is running a c implementation
48 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Theano outputs:
Apply-wise summary:
<% of local_time spent at this position> <cumulative %%>
<apply time> <cumulative seconds> <time per call>
<nb_call> <Apply position> <Apply Op name>
87.3% 87.3% 25.672s 25.672s 2.57e-03s 10000 15 Gemv{inplace}(
w, TensorConstant{-0.01}, InplaceDimShuffle{1,0}.0, Elemwise{Com
9.7% 97.0% 2.843s 28.515s 2.84e-04s 10000 1 dot(x, w)
1.3% 98.2% 0.378s 28.893s 3.78e-05s 10000 9 Elemwise{Composit
0.4% 98.7% 0.127s 29.020s 1.27e-05s 10000 10 Alloc(Elemwise{in
0.3% 99.0% 0.092s 29.112s 9.16e-06s 10000 13 Elemwise{Composit
0.3% 99.3% 0.080s 29.192s 7.99e-06s 10000 11 Elemwise{ScalarSi
... (remaining 14 Apply instances account for
0.7%(0.00s) of the runtime)
49 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Theano outputs:
51 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Exercises 4
52 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
theano.printing.pprint(variable)
>>> theano.printing.pprint(prediction)
gt((TensorConstant{1} / (TensorConstant{1} + exp(((-(x \\dot w)) - b))))
TensorConstant{0.5})
53 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
>>> theano.printing.debugprint(predict)
Elemwise{Composite{neg,{sub,{{scalar_sigmoid,GT},neg}}}} [@183160204] ’’ 2
|dot [@183018796] ’’ 1
| |x [@183000780]
| |w [@183000812]
|InplaceDimShuffle{x} [@183133580] ’’ 0
| |b [@183000876]
|TensorConstant{[ 0.5]} [@183084108]
55 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
>>> theano.printing.pydotprint_variables(prediction)
56 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
57 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
58 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
How to Debug
59 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Scan
I General form of recurrence, which can be used for looping.
I Reduction and map(loop over the leading dimensions) are special cases of
Scan
I You *scan* a function along some input sequence, producing an output at
each time-step
I The function can see the previous K time-steps of your function
I “sum()“ could be computed by scanning the z + xi function over a list,
given an initial state of “z=0“.
I Often a for-loop can be expressed as a “scan()“ operation, and “scan“ is
the closest that Theano comes to looping.
I The advantage of using “scan“ over for loops
I The number of iterations to be part of the symbolic graph
I Minimizes GPU transfers if GPU is involved
I Compute gradients through sequential steps
I Slightly faster then using a for loop in Python with a compiled Theano
function
I Can lower the overall memory usage by detecting the actual
amount of memory needed
60 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
k = T.iscalar("k"); A = T.vector("A")
# Scan has provided us with A**1 through A**k. Keep only the last
# value. Scan notices this and does not waste memory saving them.
final_result = result[-1]
print power(range(10),2)
#[ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81.] 61 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
coefficients = theano.tensor.vector("coefficients")
x = T.scalar("x"); max_coefficients_supported = 10000
Exercises 5
I Run the example in the file scan pow.py and scan poly.py
I Modify and execute the polynomial example to have the reduction done by
scan
63 / 89
Overview
Introduction
Theano Optimizations
Advanced Theano Profiling
PyCUDA Printing
CUDA Debugging
Extending Theano Loops
GpuNdArray
Conclusion
Known Limitations
64 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
PyCUDA
65 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Intro
Authors: Andreas Klöckner
I PyCUDA can access Nvidia’s CUDA parallel computation API from Python
I Object cleanup tied to lifetime of objects (RAII, Resource Acquisition Is
Initialization).
I Makes it much easier to write correct, leak- and crash-free code
I PyCUDA knows about dependencies (e.g.. it won’t detach from a context
before all memory allocated in it is also freed)
I Convenience
I Abstractions to compile CUDA code from Python:
pycuda.driver.SourceModule
I A GPU memory buffer: pycuda.gpuarray.GPUArray
I Completeness
I Binding to all of CUDA’s driver API
I Automatic Error Checking
I All CUDA errors are automatically translated into Python exceptions
I Speed
I PyCUDA’s base layer is written in C++
I Helpful documentation
66 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Example
import pycuda.autoinit
import pycuda.driver as drv
import numpy
67 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Example
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
68 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
I Gains:
I Memory Bandwidth (140GB/s vs 12 GB/s)
I Compute Bandwidth( Peak: 1 TF/s vs 0.1 TF/s in float)
I Data-parallel programming
I Losses:
I No performance portability guaranty
I Data size influence more the implementation code on GPU
I Cheap branches
I Fine-grained malloc/free*
I Recursion*
I Function pointers*
I IEEE 754FP compliance*
* Less problematic with new hardware (NVIDIA Fermi)
[slide from Andreas Klöckner]
69 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
70 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA CUDA Overview
CUDA
Extending Theano
GpuNdArray
Conclusion
Exercises 6
73 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Theano Graph
I Theano works with symbolic graphs
I Those graphs are bi-partite graphs (graph with 2 types of nodes)
I Those 2 nodes types are Apply and Variable nodes
I Inputs and Outputs are lists of Theano variables
Apply Node Op
Outputs Inputs
74 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Op Contract
class MyOp(Op):
def __eq__(self, other):
def __hash__(self):
def __str__(self):
def make_node(self, *inputs):
# Python implementation:
def perform(self, node, inputs_storage, outputs_storage):
# C implementation: [see theano web site]
# others implementation (pycuda, ...):
def make_thunk(self, node, storage_map, _, _2):
# optional:
def __init__(self, ...):
def grad(self, inputs, g):
def infer_shape(node, (i0_shapes, ...))
75 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Op Example
import theano
class DoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, x):
x = theano.tensor.as_tensor_variable(x)
return theano.Apply(self, [x], [x.type()])
def perform(self, node, inputs, output_storage):
x = inputs[0]
z = output_storage[0]
z[0] = x * 2
76 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
x = theano.tensor.matrix()
f = theano.function([x],DoubleOp()(x))
import numpy
inp = numpy.random.rand(5,5)
out = f(inp)
assert numpy.allclose(inp*2, out)
print inp
print out
77 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Exercises 7
78 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Theano+PyCUDA Op Example
class PyCUDADoubleOp(theano.Op):
def __eq__(self, other):
return type(self) == type(other)
def __hash__(self):
return hash(type(self))
def __str__(self):
return self.__class__.__name__
def make_node(self, inp):
inp = cuda.basic_ops.gpu_contiguous(
cuda.basic_ops.as_cuda_ndarray_variable(inp))
assert inp.dtype == "float32"
return theano.Apply(self, [inp], [inp.type()]) 79 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
pycuda_fct = mod.get_function("my_fct")
inputs = [ storage_map[v] for v in node.inputs]
outputs = [ storage_map[v] for v in node.outputs]
def thunk():
z = outputs[0]
if z[0] is None or z[0].shape!=inputs[0][0].shape:
z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size)
block=(512,1,1), grid=grid)
return thunk
80 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
THE_C_CODE = """
__global__ void my_fct(float * i0, float * o0, int size) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<size){
o0[i] = i0[i]*2;
}
}""")
81 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
x = theano.tensor.fmatrix()
f = theano.function([x], PyCUDADoubleOp()(x))
xv=numpy.ones((4,5), dtype="float32")
82 / 89
Overview
Introduction
Theano
Advanced Theano
Theano
PyCUDA
Theano+PyCUDA
CUDA
Extending Theano
GpuNdArray
Conclusion
Exercises 8
83 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion
I Currently there are at least 4 different GPU array data structures in use by
Python packages
I CudaNdarray (Theano), GPUArray (PyCUDA), CUDAMatrix (cudamat),
GPUArray (PyOpenCL), ...
I There are even more if we include other languages
I All of them are a subset of the functionality of numpy.ndarray on the
GPU
I Lots of duplicated effort
I GPU code is harder/slower to do correctly and fast than on the
CPU/Python
I Lack of a common array API makes it harder to port/reuse code
I Also harder to find/distribute code
I Divides development work
84 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion
Design Goals
85 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA GpuNdArray
CUDA
Extending Theano
GpuNdArray
Conclusion
I Under development
I Will be the next GPU array container for Theano (this summer!)
I Probably also for PyCUDA, PyOpenCL
I Mailing list: http://lists.tiker.net/listinfo/gpundarray
86 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion
Conclusion
87 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion
Thanks
I Thanks to our agencies that resources for this projects: Calcul Québec,
CIFAR, Compute Canada, FQRNT, MITACS, NSERC, SciNet,
SHARCNET, Ubisoft and WestGrid.
88 / 89
Overview
Introduction
Theano
Advanced Theano
PyCUDA Conclusion
CUDA
Extending Theano
GpuNdArray
Conclusion
Questions/Comments?
89 / 89