0% found this document useful (0 votes)
45 views

Python 4 HPC

The document summarizes techniques for making Python code more efficient, including: 1) Profiling tools like cProfile and memory_profiler to analyze performance bottlenecks. 2) Using NumPy and SciPy libraries which provide optimized routines for numerical computing. 3) Binding Python to compiled code for improved performance of CPU-intensive tasks. 4) Applying parallelism with modules to exploit multicore processors.

Uploaded by

Bet1nh0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Python 4 HPC

The document summarizes techniques for making Python code more efficient, including: 1) Profiling tools like cProfile and memory_profiler to analyze performance bottlenecks. 2) Using NumPy and SciPy libraries which provide optimized routines for numerical computing. 3) Binding Python to compiled code for improved performance of CPU-intensive tasks. 4) Applying parallelism with modules to exploit multicore processors.

Uploaded by

Bet1nh0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Efficient use of Python

Ariel Lozano

CÉCI training

November 22, 2017


Outline

I Analyze our code with profiling tools:


I cpu: cProfile, line_profiler, kernprof
I memory: memory_profiler, mprof
I Being a highly abstract dynamically typed language, how to
make a more efficient use of hardware internals?
I Numpy and Scipy ecosystem (mainly wrappers to C/Fortran
compiled code)
I binding to compiled code: interfaces between python and
compiled modules
I compiling: tools to compile python code
I parallelism: modules to exploit multicores
Sieve of eratostenes
Algorithm to find all prime numbers up to any given limit.
Ex: Find all the prime numbers less than or equal to 25:
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Cross out every number displaced by 2 after 2 up to the limit:
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Move to next n non crossed, cross out each non crossed
number displaced by n:
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
The remaining numbers non crossed in the list are all the
primes below limit.
2
√ to n to start crossing out.
Trivial optimization: jump directly
Then, n must loop only up to limit.
Simple python implementation

def primes_upto ( limit ) :


sieve = [ False ] * 2 + [ True ] * ( limit - 1 )
for n in xrange (2 , int ( limit ** 0 . 5 + 1 ) ) :
if sieve [ n ] :
i = n ** 2
while i < limit + 1 :
sieve [ i ] = False
i += n
return [ i for i , prime in enumerate ( sieve ) if prime ]

if __name__ = = " __main__ " :

primes = primes_upto ( 25 )
print ( primes )

$ python sieve01.py

[2, 3, 5, 7, 11, 13, 17, 19, 23]


Measuring running time

Computing primes up to 30 000 000:


I linux time command
$ time python sieve01.py

real 0m10.419s
user 0m10.192s
sys 0m0.217s
I using timeit module to average several runs
$ python -m timeit -n 3 -r 3 -s "import sieve01" \
> "sieve01.primes_upto(30000000)"

3 loops, best of 3: 10.2 sec per loop


CPU profiling: timing functions

cProfile: built-in profiling tool in the standard library. It hooks


into the virtual machine to measure the time taken to run every
function that it sees.
$ python -m cProfile -s cumulative sieve01.py
5 function calls in 10.859 seconds

Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)


1 0.000 0.000 10.859 10.859 {built-in method builtins.exec}
1 0.087 0.087 10.859 10.859 sieve01.py:3(<module>)
1 9.447 9.447 10.772 10.772 sieve01.py:3(primes_upto)
1 1.325 1.325 1.325 1.325 sieve01.py:11(<listcomp>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler

I Useful information but for big codes we will need extra


tools to visualize the dumps
CPU profiling: line by line details of a function
line_profiler: profiling individual functions on a line-by-line
basis, big overhead introduced. We must add the @profile
decorator on the function to be analyzed.
@profile
def primes_upto ( limit ) :
sieve = [ False ] * 2 + [ True ] * ( limit - 1 )
for n in xrange (2 , int ( limit ** 0 . 5 + 1 ) ) :
if sieve [ n ] :
i = n ** 2
while i < limit + 1 :
sieve [ i ] = False
i += n
return [ i for i , prime in enumerate ( sieve ) if prime ]

if __name__ = = " __main__ " :

primes = primes_upto ( 30000000 )

Then, we run the code with the kernprof.py script provided by


the package.
CPU profiling: line by line details of a function
$ kernprof -l -v sieve01_prof.py
Wrote profile results to sieve01_prof.py.lprof
Timer unit: 1e-06 s

Total time: 101.025 s


File: sieve01_prof.py
Function: primes_upto at line 2

Line # Hits Time Per Hit % Time Line Contents


==============================================================
2 @profile
3 def primes_upto(limit):
4 1 415906 415906.0 0.4 sieve = [False] * 2 + [True] *
5 5477 2307 0.4 0.0 for n in xrange(2, int(limit**0
6 5476 2362 0.4 0.0 if sieve[n]:
7 723 680 0.9 0.0 i = n**2
8 70634832 28740579 0.4 28.4 while i < limit+1:
9 70634109 33142484 0.5 32.8 sieve[i] = False
10 70634109 26776815 0.4 26.5 i += n
11 30000002 11943768 0.4 11.8 return [i for i, prime in enume

% Time is relative to each function only, not to total running time.


Memory profiling: line by line details of a function

memory_profiler: module to measure memory usage on a


line-by-line basis, runs will be slower than line_profiler. Is
also required the @profile decorator on the function to analyze.
$ python -m memory_profiler sieve01_prof.py
Filename: sieve01_prof.py

Line # Mem usage Increment Line Contents


================================================
2 32.715 MiB 0.000 MiB @profile
3 def primes_upto(limit):
4 261.703 MiB 228.988 MiB sieve = [False] * 2 + [True] * (limit - 1)
5 261.703 MiB 0.000 MiB for n in xrange(2, int(limit**0.5 + 1)):
6 261.703 MiB 0.000 MiB if sieve[n]:
7 261.703 MiB 0.000 MiB i = n**2
8 261.703 MiB 0.000 MiB while i < limit+1:
9 261.703 MiB 0.000 MiB sieve[i] = False
10 261.703 MiB 0.000 MiB i += n
11 return [i for i, prime in enumerate(sieve) if
Memory profiling: line by line details of a function

Why are 228 MB allocated on this line?


4 261.703 MiB 228.988 MiB sieve = [False] * 2 + [True] * (limit - 1)

I In a Python list each boolean variable has a size of 8 bytes.


The standard for a C long int in 64-bits.
I We are creating a list with 30000002 elements.
30000002∗8
I Doing the math: 1024∗1024 = 228.881
Memory profiling: analyzing the whole run vs time

I Line by line analysis introduces a huge overhead, can be up


to 100x slower
I We can miss information due to many
allocations/deallocations taking place on a single line
I The memory_profiler package provides the mprof tool to
analyze and visualize the memory usage as a function of
time
I It has a very minor impact on the running time
I Usage:
$ mprof run --python python2 mycode.py
$ mprof plot
Memory profiling: analyzing the whole run vs time
$ mprof run --python python2 sieve01.py
$ mprof plot
Memory profiling: analyzing the whole run vs time

We can add a @profile decorator and timestamps to introduce


details in the analysis.
@profile
def primes_upto ( limit ) :
with profile . timestamp ( " create_sieve_list " ) :
sieve = [ False ] * 2 + [ True ] * ( limit - 1 )
with profile . timestamp ( " cross_out_sieve " ) :
for n in xrange (2 , int ( limit ** 0 . 5 + 1 ) ) :
if sieve [ n ] :
i = n ** 2
while i < limit + 1 :
sieve [ i ] = False
i += n
return [ i for i , prime in enumerate ( sieve ) if prime ]
Memory profiling: analyzing the whole run vs time
$ mprof run --python python2 sieve01_memprof.py
$ mprof plot
Memory profiling: analyzing the whole run vs time
Why the 500 MB peak during the sieve list creation?
I Experimenting with the mprof tool can be verified that:
sieve = [ False ] * 2 + [ True ] * ( limit - 1 )

I is actually equivalent to something like:


sieve1 = [ False ] * 2
sieve2 = [ True ] * ( limit - 1 )
sieve = sieve1 + sieve2
del sieve1
del sieve2

I is allocated temporarily an extra ≈ 30E6 boolean list !!


I We can try to replace with:
sieve = [ True ] * ( limit + 1 )
sieve [ 0 ] = False
sieve [ 1 ] = False
Memory profiling: analyzing the whole run vs time
$ mprof run --python python2 sieve02_memprof.py
$ mprof plot
Numpy library

I Provides a new kind of array datatype


I Contains methods for fast operations on entire arrays
without having to write loops
I They are basically wrappers to compiled C/Fortran/C++
code
I Runs almost as quickly as C
I It is the foundation of many other higher-level numerical
tools
I Compares to MATLAB in functionality
Numpy library: matrix vector product

>>> import numpy as np


>>> a = np . array ( [ [ 5 , 1 ,3 ] ,
[ 1 , 1 ,1 ] ,
[ 1 , 2 ,1 ] ] )
>>> b = np . array ( [1 , 2 , 3 ] )
>>> c = a . dot ( b )
array ( [ 16 , 6 , 8 ] )
Numpy library: sieve revisited

We replace the sieve list with a Numpy boolean array:


import numpy as np

def primes_upto ( limit ) :


sieve = np . ones ( limit + 1 , dtype = np . bool )
sieve [ 0 ] = False
sieve [ 1 ] = False
for n in xrange (2 , int ( limit ** 0 . 5 + 1 ) ) :
if sieve [ n ] :
i = n ** 2
while i < limit + 1 :
sieve [ i ] = False
i += n
return [ i for i , prime in enumerate ( sieve ) if prime ]
Numpy library: sieve revisited

I In a Numpy array each boolean has a size of 1 byte


I Math now: 30000002∗1
1024∗1024 = 28.61
Numpy library: sieve revisited
I Timing did not improve with Numpy array and same loop
I Fully Numpy solution using slice indexing to iterate:

import numpy as np

def primes_upto ( limit ) :


sieve = np . ones ( limit + 1 , dtype = np . bool )
sieve [ 0 ] = False
sieve [ 1 ] = False
for n in xrange (2 , int ( limit ** 0 . 5 + 1 ) ) :
if sieve [ n ] :
sieve [ n * n : : n ] = 0
return np . nonzero ( sieve ) [ 0 ]

$ time python2 sieve04_np.py


real 0m0.552s
user 0m0.518s
sys 0m0.033s

I 22x gain in time!!


Numpy library: sieve line by line profiling
$ kernprof -l -v sieve04_np_prof.py
Wrote profile results to sieve04_np_prof.py.lprof
Timer unit: 1e-06 s

Total time: 0.482723 s


File: sieve04_np_prof.py
Function: primes_upto at line 3

Line # Hits Time Per Hit % Time Line Contents


==============================================================
3 @profile
4 def primes_upto(limit):
5 1 8785 8785.0 1.8 sieve = np.ones(limit
6 1 5 5.0 0.0 sieve[0] = False
7 1 0 0.0 0.0 sieve[1] = False
8 5477 2796 0.5 0.6 for n in xrange(2, int
9 5476 3119 0.6 0.6 if sieve[n]:
10 723 420784 582.0 87.2 sieve[n**2::n]
11 1 47234 47234.0 9.8 return np.nonzero(siev
Numpy library: sieve line by line profiling

I line_profiler helps to understand the massive gain


I Pure python solution:
6 5476 2362 0.4 0.0 if sieve[n]:
7 723 680 0.9 0.0 i = n**2
8 70634832 28740579 0.4 28.4 while i < limit+1:
9 70634109 33142484 0.5 32.8 sieve[i] = False
10 70634109 26776815 0.4 26.5 i += n
I Full Numpy solution:
9 5476 3119 0.6 0.6 if sieve[n]:
10 723 420784 582.0 87.2 sieve[n**2::n] = 0
I The loops to cross out the sieve are fully performed by
lower level implementations in Numpy
I Time and memory usage is the same as C or Fortran
compiled solutions !
CPU and Memory profiling: summary

I Line-by-line profiling introduces a huge overhead, they


must be used reducing the problem size and for specific
functions detected as bottlenecks
I The mprof tool is very dynamic, timestammping in a smart
way can be used both as a fast CPU and Memory profiler
I The cProfile dumps are great to detect bottlenecks on big
projects, but a visualization tool is almost mandatory.
Explore the KCachegrind package, usual workflow:
$ python -m cProfile -o prof.out sieve02.py
$ pyprof2calltree -i prof.out -k
Numpy library: SciPy ecosystem
Collection of open source software for scientific computing in
Python
I Core packages:
I NumPy: the fundamental package for numerical computation
I SciPy library: collection of numerical algorithms and
domain-specific toolboxes, including signal processing, fourier
transforms, clustering, optimization, statistics...
I Matplotlib: a mature plotting package, provides publication-quality
2D plotting as well as rudimentary 3D plotting
I Data and computation:
I pandas: providing high-performance, easy to use data structures
(similar to R)
I SymPy: symbolic mathematics and computer algebra
I scikit-image: algorithms for image processing
I scikit-learn: algorithms and tools for machine learning
I h5py and PyTables: can both access data stored in the HDF5
format
Python Bindings

I Interfacing python with compiled code can provide huge


performance gains
I f2py: project to provide a connection between Python and
Fortran languages
I weave: tools for including C/C++ code within Python code
I cffi (C Foreign Function Interface for Python): Interact with
almost any C code from Python.
I ctypes: foreign function library for Python. It provides C
compatible data types and allows calling functions in DLLs
or shared libraries.
Python Bindings: f2py example

subroutine foo ( a ) import hello


integer a
print * , " Hello from Fortran ! " if __name__ = = " __main__ " :
print * , " a = " ,a
end hello . foo ( 10 )

$ f2py2 -c -m hello hello.f90 $ python2 call_fhello.py


Hello from Fortran!
a= 10
Compiled Python
There are also tools to compile python code
I cython: C-Extensions for Python
I optimising and static compiler
I can compile Python code and Cython language
I can compile Python with Numpy code
I can do bindings with C code
I Pypy: Just-in-time compiler
I sometimes less memory hungry than Cython
I not fully compliant with Python with Numpy code
I Numba: a compiler specialized for numpy code using the
LLVM compiler
I Pythran: compiler for both numpy and non-numpy code.
Takes advantage of multi-cores and single instruction
multiple data (SIMD) units
I All, except pypy requires to modify or decorate the original
python code
Compiled Python: pypy

I We can directly run the original sieve01.py with pypy


$ time pypy sieve01.py

real 0m2.593s
user 0m2.222s
sys 0m0.294s
Parallel processing

I multiprocessing module
I allows to use process- and thread-based parallel processing
I allows to share memory among processes
I constrained to single-machine multicore parallelism
I mpi4py
I Python bindings to the MPI-1/2/3 interfaces
I if you know MPI on C/Fortran you already know mpi4py
I can make use equivalently of multiple cores on a single-machine
or distributed
I each process has a separate address space, no possibility to
share memory between them
I we covered it in the MPI session
Further information on the topic

I High Performance Python by By Micha Gorelick and Ian


Ozsvald
I Python in HPC Tutorial:
https://github.com/pyHPC/pyhpc-tutorial

You might also like