SlideShare a Scribd company logo
High Performance Computing with Python (4 hour tutorial) EuroPython 2011
Goal Get you writing faster code for CPU-bound problems using Python Your task is probably in pure Python, is CPU bound and can be parallelised (right?) We're not looking at network-bound problems Profiling + Tools == Speed
Get the source please! http://tinyurl.com/europyhpc (original:  http://ianozsvald.com/wp-content/hpc_tutorial_code_europython2011_v1.zip ) google: “github ianozsvald”, get HPC full source (but you can do this after!)
About me (Ian Ozsvald) A.I. researcher in industry for 12 years C, C++, (some) Java, Python for 8 years Demo'd pyCUDA and Headroid last year Lecturer on A.I. at Sussex Uni (a bit) ShowMeDo.com co-founder Python teacher, BrightonPy co-founder IanOzsvald.com - MorConsulting.com
Overview (pre-requisites) cProfile, line_profiler, runsnake numpy Cython and ShedSkin multiprocessing ParallelPython PyPy pyCUDA
We won't be looking at... Algorithmic choices, clusters or cloud Gnumpy (numpy->GPU) Theano (numpy(ish)->CPU/GPU) CopperHead (numpy(ish)->GPU) BottleNeck (Cython'd numpy) Map/Reduce pyOpenCL
Something to consider “ Proebsting's Law” http://research.microsoft.com/en-us/um/people/toddpro/papers/law.htm Compiler advances (generally) unhelpful (sort-of – consider auto vectorisation!) Multi-core common Very-parallel (CUDA, OpenCL, MS AMP, APUs) should be considered
What can we expect? Close to C speeds (shootout): http://attractivechaos.github.com/plb/ http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php Depends on how much work you put in nbody JavaScript much faster than Python but we can catch it/beat it (and get close to C speed)
Practical result - PANalytical
Mandelbrot results (Desktop i3)
Our code pure_python.py  numpy_vector.py  pure_python.py 1000 1000 # RUN Our two building blocks Google “github ianozsvald” -> EuroPython2011_HighPerformanceComputing https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing
Profiling bottlenecks python -m cProfile -o rep.prof pure_python.py 1000 1000 import pstats p = pstats.Stats('rep.prof') p.sort_stats('cumulative').print_stats(10)
cProfile output 51923594 function calls (51923523 primitive calls) in 74.301 seconds ncalls  tottime  percall  cumtime  percall  pure_python.py:1(<module>) 1  0.034  0.034  74.303  74.303  pure_python.py:23(calc_pure_python) 1  0.273  0.273  74.268  74.268  pure_python.py:9(calculate_z_serial_purepython) 1  57.168  57.168  73.580  73.580  {abs} 51,414,419 12.465  0.000  12.465  0.000 ...
RunSnakeRun
Let's profile python.py python -m cProfile -o res.prof pure_python.py 1000 1000 runsnake res.prof Let's look at the result
What's the problem? What's really slow? Useful from a high level... We want a line profiler!
line_profiler.py kernprof.py -l -v pure_python_lineprofiler.py 1000 1000 Warning...slow! We might want to use  300 100
kernprof.py output ...% Time  Line Contents ===================== @profile def calculate_z_serial_purepython(q, maxiter, z): 0.0  output = [0] * len(q) 1.1  for i in range(len(q)): 27.8  for iteration in range(maxiter): 35.8  z[i] = z[i]*z[i] + q[i] 31.9  if abs(z[i]) > 2.0:
Dereferencing is slow Dereferencing involves lookups – slow Our ' i ' changes slowly zi = z[i]; qi = q[i] # DO IT Change all  z[i]  and  q[i]  references Run  kernprof  again Is it cheaper?
We have faster code pure_python_2.py is faster, we'll use this as the basis for the next steps There are tricks: sets over lists if possible use dict[] rather than dict.get() build-in sort is fast list comprehensions map rather than loops
PyPy 1.5 Confession – I'm a newbie Probably cool tricks to learn pypy pure_python_2.py 1000 1000 PIL support, numpy isn't My (bad) code needs numpy for display (maybe you can fix that?) pypy -m cProfile -o runpypy.prof pure_python_2.py 1000 1000 # abs but no range
Cython Manually add types, converts to C .pyx files (built on Pyrex) Win/Mac/Lin with gcc, msvc etc 10-100* speed-up numpy integration http://cython.org/
Cython on pure_python_2.py # ./cython_pure_python Make  calculate_z.py , test it works Turn  calculate_z.py  to  .pyx Add  setup.py  (see Getting Started doc) python setup.py build_ext --inplace cython -a calculate_z.pyx  to get profiling feedback (.html)
Cython types Help Cython by adding annotations: list q z int  unsigned int # hint no negative indices with for loop  complex and complex double How much faster?
Compiler directives http://wiki.cython.org/enhancements/compilerdirectives We can go faster (maybe): #cython: boundscheck=False #cython: wraparound=False Profiling: #cython: profile=True Check profiling works Show  _2_bettermath # FAST!
ShedSkin http://code.google.com/p/shedskin/ Auto-converts Python to C++ (auto type inference) Can only import modules that have been implemented No numpy, PIL etc but great for writing new fast modules 3000 SLOC 'limit', always improving
Easy to use # ./shedskin/ shedskin shedskin1.py make ./shedskin1 1000 1000 shedskin shedskin2.py; make ./shedskin2 1000 1000 # FAST! No easy profiling, complex is slow (for now)
numpy vectors http://numpy.scipy.org/ Vectors not brilliantly suited to Mandelbrot (but we'll ignore that...) numpy is very-parallel for CPUs a = numpy.array([1,2,3,4]) a *= 3 -> numpy.array([3,6,9,12])
Vector outline... # ./numpy_vector/numpy_vector.py for iteration... z = z*z + q done = np.greater(abs(z), 2.0) q = np.where(done,0+0j, q) z = np.where(done,0+0j, z) output = np.where(done,  iteration, output)
Profiling some more python numpy_vector.py 1000 1000 kernprof.py -l -v numpy_vector.py 300 100 How could we break out early? How big is 250,000 complex numbers? # .nbytes, .size
Cache sizes Modern CPUs have 2-6MB caches Tuning is hard (and may not be worthwhile) Heuristic: Either keep it tiny (<64KB) or worry about really big data sets (>20MB) # numpy_vector_2.py
Speed vs cache size (Core2/i3)
NumExpr http://code.google.com/p/numexpr/ This is magic With Intel MKL it goes even faster # ./numpy_vector_numexpr/ python numpy_vector_numexpr.py 1000 1000 Now convert your  numpy_vector.py
numpy and iteration Normally there's no point using numpy if we aren't using vector operations python numpy_loop.py 1000 1000 Is it any faster? Let's run  kernprof.py  on this and the earlier  pure_python_2.py Any significant differences?
Cython on numpy_loop.py Can low-level C give us a speed-up over vectorised C? # ./cython_numpy_loop/ http://docs.cython.org/src/tutorial/numpy.html Your task – make .pyx, start without types, make it work from  numpy_loop.py Add basic types, use  cython -a
multiprocessing Using all our CPUs is cool, 4 are common, 8 will be common Global Interpreter Lock (isn't our enemy) Silo'd processes are easiest to parallelise http://docs.python.org/library/multiprocessing.html
multiprocessing Pool # ./multiprocessing/multi.py p = multiprocessing.Pool() po = p.map_async(fn, args) result = po.get() # for all po objects join the result items to make full result
Making chunks of work Split the work into chunks (follow my code) Splitting by number of CPUs is good Submit the jobs with map_async Get the results back, join the lists
Code outline Copy my chunk code output = [] for chunk in chunks: out = calc...(chunk) output += out
ParallelPython Same principle as multiprocessing but allows >1 machine with >1 CPU http://www.parallelpython.com/ Seems to work poorly with lots of data (e.g. 8MB split into 4 lists...!) We can run it locally, run it locally via ppserver.py and run it remotely too Can we demo it to another machine?
ParallelPython + binaries We can ask it to use modules, other functions and our own compiled modules Works for Cython and ShedSkin Modules have to be in PYTHONPATH (or current directory for ppserver.py) parallelpython_cython_pure_python
Challenge... Can we send binaries (.so/.pyd) automatically? It looks like we could We'd then avoid having to deploy to remote machines ahead of time... Anybody want to help me?
pyCUDA NVIDIA's CUDA -> Python wrapper http://mathema.tician.de/software/pycuda Can be a pain to install... Has numpy-like interface and two lower level C interfaces
pyCUDA demos # ./pyCUDA/ I'm using float32/complex64 as my CUDA card is too old :-( (Compute 1.3) numpy-like interface is easy but slow elementwise requires C thinking sourcemodule gives you complete control Great for prototyping and moving to C
Birds of Feather? numpy is cool but CPU bound pyCUDA is cool and is numpy-like Could we monkey patch numpy to auto-run CUDA(/openCL) if a card is present? Anyone want to chat about this?
Future trends multi-core is obvious CUDA-like systems are inevitable write-once, deploy to many targets – that would be lovely Cython+ShedSkin could be cool Parallel Cython could be cool Refactoring with rope is definitely cool
Bits to consider Cython being wired into Python (GSoC) CorePy assembly -> numpy  http://numcorepy.blogspot.com/ PyPy advancing nicely GPUs being interwoven with CPUs (APU) numpy+NumExpr->GPU/CPU mix? Learning how to massively parallelise is the key
Feedback I plan to write this up I want feedback (and maybe a testimonial if you found this helpful?) [email_address] Thank you :-)

More Related Content

PDF
Python Performance 101
Ankur Gupta
 
PPT
Profiling and optimization
g3_nittala
 
PDF
Don't do this
Richard Jones
 
PDF
Odessapy2013 - Graph databases and Python
Max Klymyshyn
 
PDF
Python于Web 2.0网站的应用 - QCon Beijing 2010
Qiangning Hong
 
PDF
Beyond tf idf why, what & how
lucenerevolution
 
PDF
Python and sysadmin I
Guixing Bai
 
PDF
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
Mail.ru Group
 
Python Performance 101
Ankur Gupta
 
Profiling and optimization
g3_nittala
 
Don't do this
Richard Jones
 
Odessapy2013 - Graph databases and Python
Max Klymyshyn
 
Python于Web 2.0网站的应用 - QCon Beijing 2010
Qiangning Hong
 
Beyond tf idf why, what & how
lucenerevolution
 
Python and sysadmin I
Guixing Bai
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
Mail.ru Group
 

What's hot (20)

PDF
Commit ускоривший python 2.7.11 на 30% и новое в python 3.5
PyNSK
 
PPTX
Introduction to Python and TensorFlow
Bayu Aldi Yansyah
 
PDF
«Отладка в Python 3.6: Быстрее, Выше, Сильнее» Елизавета Шашкова, JetBrains
it-people
 
PPTX
Python
Wei-Bo Chen
 
PDF
Why Python (for Statisticians)
Matt Harrison
 
PDF
Python profiling
dreampuf
 
PDF
Introduction to advanced python
Charles-Axel Dein
 
PDF
PyCon 2013 : Scripting to PyPi to GitHub and More
Matt Harrison
 
PDF
Fun never stops. introduction to haskell programming language
Pawel Szulc
 
PPTX
python beginner talk slide
jonycse
 
PPSX
What's new in C# 6 - NetPonto Porto 20160116
Paulo Morgado
 
PPTX
Scala - where objects and functions meet
Mario Fusco
 
PPSX
Tuga it 2016 - What's New In C# 6
Paulo Morgado
 
PDF
Sneaking inside Kotlin features
Chandra Sekhar Nayak
 
PDF
Functions
Marieswaran Ramasamy
 
PPTX
Столпы функционального программирования для адептов ООП, Николай Мозговой
Sigma Software
 
PDF
Python Async IO Horizon
Lukasz Dobrzanski
 
PDF
The best language in the world
David Muñoz Díaz
 
PPTX
Kotlin collections
Myeongin Woo
 
PPT
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
Fantix King 王川
 
Commit ускоривший python 2.7.11 на 30% и новое в python 3.5
PyNSK
 
Introduction to Python and TensorFlow
Bayu Aldi Yansyah
 
«Отладка в Python 3.6: Быстрее, Выше, Сильнее» Елизавета Шашкова, JetBrains
it-people
 
Python
Wei-Bo Chen
 
Why Python (for Statisticians)
Matt Harrison
 
Python profiling
dreampuf
 
Introduction to advanced python
Charles-Axel Dein
 
PyCon 2013 : Scripting to PyPi to GitHub and More
Matt Harrison
 
Fun never stops. introduction to haskell programming language
Pawel Szulc
 
python beginner talk slide
jonycse
 
What's new in C# 6 - NetPonto Porto 20160116
Paulo Morgado
 
Scala - where objects and functions meet
Mario Fusco
 
Tuga it 2016 - What's New In C# 6
Paulo Morgado
 
Sneaking inside Kotlin features
Chandra Sekhar Nayak
 
Столпы функционального программирования для адептов ООП, Николай Мозговой
Sigma Software
 
Python Async IO Horizon
Lukasz Dobrzanski
 
The best language in the world
David Muñoz Díaz
 
Kotlin collections
Myeongin Woo
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
Fantix King 王川
 
Ad

Viewers also liked (11)

PPT
Html5/CSS3
Simratpreet Singh
 
PDF
Faster Python
Anoop Thomas Mathew
 
PDF
Reversing the dropbox client on windows
extremecoders
 
PPSX
HTML5, CSS3, and JavaScript
Zac Gordon
 
PDF
Inside the ANN: A visual and intuitive journey to understand how artificial n...
XavierArrufat
 
PPT
Eduvision - Webinar html5 css3
Eduvision Opleidingen
 
PDF
Kick start graph visualization projects
Linkurious
 
PDF
HTML practicals
Abhishek Sharma
 
PDF
Introduction to Apache Accumulo
Aaron Cordova
 
PDF
Deploying and Managing Hadoop Clusters with AMBARI
DataWorks Summit
 
PDF
Python Coroutines, Present and Future
emptysquare
 
Html5/CSS3
Simratpreet Singh
 
Faster Python
Anoop Thomas Mathew
 
Reversing the dropbox client on windows
extremecoders
 
HTML5, CSS3, and JavaScript
Zac Gordon
 
Inside the ANN: A visual and intuitive journey to understand how artificial n...
XavierArrufat
 
Eduvision - Webinar html5 css3
Eduvision Opleidingen
 
Kick start graph visualization projects
Linkurious
 
HTML practicals
Abhishek Sharma
 
Introduction to Apache Accumulo
Aaron Cordova
 
Deploying and Managing Hadoop Clusters with AMBARI
DataWorks Summit
 
Python Coroutines, Present and Future
emptysquare
 
Ad

Similar to Euro python2011 High Performance Python (20)

PDF
PyCon2022 - Building Python Extensions
Henry Schreiner
 
PPTX
Pypy is-it-ready-for-production-the-sequel
Mark Rees
 
PPTX
carrow - Go bindings to Apache Arrow via C++-API
Yoni Davidson
 
PPTX
Performance Enhancement Tips
Tim (文昌)
 
PDF
Multiprocessing with python
Patrick Vergain
 
PDF
Concurrency and Python - PyCon MY 2015
Boey Pak Cheong
 
PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PDF
PythonBrasil[8] - CPython for dummies
Tatiana Al-Chueyr
 
PDF
PyCon Estonia 2019
Travis Oliphant
 
PDF
Parallelism in a NumPy-based program
Ralf Gommers
 
PDF
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
PDF
Modern binary build systems - PyCon 2024
Henry Schreiner
 
PDF
Elasticwulf Pycon Talk
Peter Skomoroch
 
PDF
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Diego Freniche Brito
 
PDF
May2010 hex-core-opt
Jeff Larkin
 
PPTX
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
PPTX
Scaling Python to CPUs and GPUs
Travis Oliphant
 
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
PDF
Python and Pytorch tutorial and walkthrough
gabriellekuruvilla
 
ODP
Introduction to Raspberry Pi and GPIO
Kris Findlay
 
PyCon2022 - Building Python Extensions
Henry Schreiner
 
Pypy is-it-ready-for-production-the-sequel
Mark Rees
 
carrow - Go bindings to Apache Arrow via C++-API
Yoni Davidson
 
Performance Enhancement Tips
Tim (文昌)
 
Multiprocessing with python
Patrick Vergain
 
Concurrency and Python - PyCon MY 2015
Boey Pak Cheong
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PythonBrasil[8] - CPython for dummies
Tatiana Al-Chueyr
 
PyCon Estonia 2019
Travis Oliphant
 
Parallelism in a NumPy-based program
Ralf Gommers
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Holden Karau
 
Modern binary build systems - PyCon 2024
Henry Schreiner
 
Elasticwulf Pycon Talk
Peter Skomoroch
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
Diego Freniche Brito
 
May2010 hex-core-opt
Jeff Larkin
 
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
Scaling Python to CPUs and GPUs
Travis Oliphant
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
MLconf
 
Python and Pytorch tutorial and walkthrough
gabriellekuruvilla
 
Introduction to Raspberry Pi and GPIO
Kris Findlay
 

Recently uploaded (20)

PDF
Software Development Company | KodekX
KodekX
 
PPTX
C Programming Basics concept krnppt.pptx
Karan Prajapat
 
PDF
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
DOCX
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Software Development Company | KodekX
KodekX
 
C Programming Basics concept krnppt.pptx
Karan Prajapat
 
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
Top AI API Alternatives to OpenAI: A Side-by-Side Breakdown
vilush
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Captain IT
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
AVTRON Technologies LLC
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 

Euro python2011 High Performance Python

  • 1. High Performance Computing with Python (4 hour tutorial) EuroPython 2011
  • 2. Goal Get you writing faster code for CPU-bound problems using Python Your task is probably in pure Python, is CPU bound and can be parallelised (right?) We're not looking at network-bound problems Profiling + Tools == Speed
  • 3. Get the source please! http://tinyurl.com/europyhpc (original: http://ianozsvald.com/wp-content/hpc_tutorial_code_europython2011_v1.zip ) google: “github ianozsvald”, get HPC full source (but you can do this after!)
  • 4. About me (Ian Ozsvald) A.I. researcher in industry for 12 years C, C++, (some) Java, Python for 8 years Demo'd pyCUDA and Headroid last year Lecturer on A.I. at Sussex Uni (a bit) ShowMeDo.com co-founder Python teacher, BrightonPy co-founder IanOzsvald.com - MorConsulting.com
  • 5. Overview (pre-requisites) cProfile, line_profiler, runsnake numpy Cython and ShedSkin multiprocessing ParallelPython PyPy pyCUDA
  • 6. We won't be looking at... Algorithmic choices, clusters or cloud Gnumpy (numpy->GPU) Theano (numpy(ish)->CPU/GPU) CopperHead (numpy(ish)->GPU) BottleNeck (Cython'd numpy) Map/Reduce pyOpenCL
  • 7. Something to consider “ Proebsting's Law” http://research.microsoft.com/en-us/um/people/toddpro/papers/law.htm Compiler advances (generally) unhelpful (sort-of – consider auto vectorisation!) Multi-core common Very-parallel (CUDA, OpenCL, MS AMP, APUs) should be considered
  • 8. What can we expect? Close to C speeds (shootout): http://attractivechaos.github.com/plb/ http://shootout.alioth.debian.org/u32/which-programming-languages-are-fastest.php Depends on how much work you put in nbody JavaScript much faster than Python but we can catch it/beat it (and get close to C speed)
  • 9. Practical result - PANalytical
  • 11. Our code pure_python.py numpy_vector.py pure_python.py 1000 1000 # RUN Our two building blocks Google “github ianozsvald” -> EuroPython2011_HighPerformanceComputing https://github.com/ianozsvald/EuroPython2011_HighPerformanceComputing
  • 12. Profiling bottlenecks python -m cProfile -o rep.prof pure_python.py 1000 1000 import pstats p = pstats.Stats('rep.prof') p.sort_stats('cumulative').print_stats(10)
  • 13. cProfile output 51923594 function calls (51923523 primitive calls) in 74.301 seconds ncalls tottime percall cumtime percall pure_python.py:1(<module>) 1 0.034 0.034 74.303 74.303 pure_python.py:23(calc_pure_python) 1 0.273 0.273 74.268 74.268 pure_python.py:9(calculate_z_serial_purepython) 1 57.168 57.168 73.580 73.580 {abs} 51,414,419 12.465 0.000 12.465 0.000 ...
  • 15. Let's profile python.py python -m cProfile -o res.prof pure_python.py 1000 1000 runsnake res.prof Let's look at the result
  • 16. What's the problem? What's really slow? Useful from a high level... We want a line profiler!
  • 17. line_profiler.py kernprof.py -l -v pure_python_lineprofiler.py 1000 1000 Warning...slow! We might want to use 300 100
  • 18. kernprof.py output ...% Time Line Contents ===================== @profile def calculate_z_serial_purepython(q, maxiter, z): 0.0 output = [0] * len(q) 1.1 for i in range(len(q)): 27.8 for iteration in range(maxiter): 35.8 z[i] = z[i]*z[i] + q[i] 31.9 if abs(z[i]) > 2.0:
  • 19. Dereferencing is slow Dereferencing involves lookups – slow Our ' i ' changes slowly zi = z[i]; qi = q[i] # DO IT Change all z[i] and q[i] references Run kernprof again Is it cheaper?
  • 20. We have faster code pure_python_2.py is faster, we'll use this as the basis for the next steps There are tricks: sets over lists if possible use dict[] rather than dict.get() build-in sort is fast list comprehensions map rather than loops
  • 21. PyPy 1.5 Confession – I'm a newbie Probably cool tricks to learn pypy pure_python_2.py 1000 1000 PIL support, numpy isn't My (bad) code needs numpy for display (maybe you can fix that?) pypy -m cProfile -o runpypy.prof pure_python_2.py 1000 1000 # abs but no range
  • 22. Cython Manually add types, converts to C .pyx files (built on Pyrex) Win/Mac/Lin with gcc, msvc etc 10-100* speed-up numpy integration http://cython.org/
  • 23. Cython on pure_python_2.py # ./cython_pure_python Make calculate_z.py , test it works Turn calculate_z.py to .pyx Add setup.py (see Getting Started doc) python setup.py build_ext --inplace cython -a calculate_z.pyx to get profiling feedback (.html)
  • 24. Cython types Help Cython by adding annotations: list q z int unsigned int # hint no negative indices with for loop complex and complex double How much faster?
  • 25. Compiler directives http://wiki.cython.org/enhancements/compilerdirectives We can go faster (maybe): #cython: boundscheck=False #cython: wraparound=False Profiling: #cython: profile=True Check profiling works Show _2_bettermath # FAST!
  • 26. ShedSkin http://code.google.com/p/shedskin/ Auto-converts Python to C++ (auto type inference) Can only import modules that have been implemented No numpy, PIL etc but great for writing new fast modules 3000 SLOC 'limit', always improving
  • 27. Easy to use # ./shedskin/ shedskin shedskin1.py make ./shedskin1 1000 1000 shedskin shedskin2.py; make ./shedskin2 1000 1000 # FAST! No easy profiling, complex is slow (for now)
  • 28. numpy vectors http://numpy.scipy.org/ Vectors not brilliantly suited to Mandelbrot (but we'll ignore that...) numpy is very-parallel for CPUs a = numpy.array([1,2,3,4]) a *= 3 -> numpy.array([3,6,9,12])
  • 29. Vector outline... # ./numpy_vector/numpy_vector.py for iteration... z = z*z + q done = np.greater(abs(z), 2.0) q = np.where(done,0+0j, q) z = np.where(done,0+0j, z) output = np.where(done, iteration, output)
  • 30. Profiling some more python numpy_vector.py 1000 1000 kernprof.py -l -v numpy_vector.py 300 100 How could we break out early? How big is 250,000 complex numbers? # .nbytes, .size
  • 31. Cache sizes Modern CPUs have 2-6MB caches Tuning is hard (and may not be worthwhile) Heuristic: Either keep it tiny (<64KB) or worry about really big data sets (>20MB) # numpy_vector_2.py
  • 32. Speed vs cache size (Core2/i3)
  • 33. NumExpr http://code.google.com/p/numexpr/ This is magic With Intel MKL it goes even faster # ./numpy_vector_numexpr/ python numpy_vector_numexpr.py 1000 1000 Now convert your numpy_vector.py
  • 34. numpy and iteration Normally there's no point using numpy if we aren't using vector operations python numpy_loop.py 1000 1000 Is it any faster? Let's run kernprof.py on this and the earlier pure_python_2.py Any significant differences?
  • 35. Cython on numpy_loop.py Can low-level C give us a speed-up over vectorised C? # ./cython_numpy_loop/ http://docs.cython.org/src/tutorial/numpy.html Your task – make .pyx, start without types, make it work from numpy_loop.py Add basic types, use cython -a
  • 36. multiprocessing Using all our CPUs is cool, 4 are common, 8 will be common Global Interpreter Lock (isn't our enemy) Silo'd processes are easiest to parallelise http://docs.python.org/library/multiprocessing.html
  • 37. multiprocessing Pool # ./multiprocessing/multi.py p = multiprocessing.Pool() po = p.map_async(fn, args) result = po.get() # for all po objects join the result items to make full result
  • 38. Making chunks of work Split the work into chunks (follow my code) Splitting by number of CPUs is good Submit the jobs with map_async Get the results back, join the lists
  • 39. Code outline Copy my chunk code output = [] for chunk in chunks: out = calc...(chunk) output += out
  • 40. ParallelPython Same principle as multiprocessing but allows >1 machine with >1 CPU http://www.parallelpython.com/ Seems to work poorly with lots of data (e.g. 8MB split into 4 lists...!) We can run it locally, run it locally via ppserver.py and run it remotely too Can we demo it to another machine?
  • 41. ParallelPython + binaries We can ask it to use modules, other functions and our own compiled modules Works for Cython and ShedSkin Modules have to be in PYTHONPATH (or current directory for ppserver.py) parallelpython_cython_pure_python
  • 42. Challenge... Can we send binaries (.so/.pyd) automatically? It looks like we could We'd then avoid having to deploy to remote machines ahead of time... Anybody want to help me?
  • 43. pyCUDA NVIDIA's CUDA -> Python wrapper http://mathema.tician.de/software/pycuda Can be a pain to install... Has numpy-like interface and two lower level C interfaces
  • 44. pyCUDA demos # ./pyCUDA/ I'm using float32/complex64 as my CUDA card is too old :-( (Compute 1.3) numpy-like interface is easy but slow elementwise requires C thinking sourcemodule gives you complete control Great for prototyping and moving to C
  • 45. Birds of Feather? numpy is cool but CPU bound pyCUDA is cool and is numpy-like Could we monkey patch numpy to auto-run CUDA(/openCL) if a card is present? Anyone want to chat about this?
  • 46. Future trends multi-core is obvious CUDA-like systems are inevitable write-once, deploy to many targets – that would be lovely Cython+ShedSkin could be cool Parallel Cython could be cool Refactoring with rope is definitely cool
  • 47. Bits to consider Cython being wired into Python (GSoC) CorePy assembly -> numpy http://numcorepy.blogspot.com/ PyPy advancing nicely GPUs being interwoven with CPUs (APU) numpy+NumExpr->GPU/CPU mix? Learning how to massively parallelise is the key
  • 48. Feedback I plan to write this up I want feedback (and maybe a testimonial if you found this helpful?) [email_address] Thank you :-)