-
Notifications
You must be signed in to change notification settings - Fork 81
/
Copy pathperformance.txt
473 lines (345 loc) · 18.9 KB
/
performance.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
.. _performance-howto:
=========================
How to optimize for speed
=========================
The following gives some practical guidelines to help you write efficient
code for the scikit-learn project.
.. note::
While it is always useful to profile your code so as to **check
performance assumptions**, it is also highly recommended
to **review the literature** to ensure that the implemented algorithm
is the state of the art for the task before investing into costly
implementation optimization.
Times and times, hours of efforts invested in optimizing complicated
implementation details have been rendered irrelevant by the subsequent
discovery of simple **algorithmic tricks**, or by using another algorithm
altogether that is better suited to the problem.
The section :ref:`warm-restarts` gives an example of such a trick.
Python, Cython or C/C++?
========================
.. currentmodule:: sklearn
In general, the scikit-learn project emphasizes the **readability** of
the source code to make it easy for the project users to dive into the
source code so as to understand how the algorithm behaves on their data
but also for ease of maintainability (by the developers).
When implementing a new algorithm is thus recommended to **start
implementing it in Python using Numpy and Scipy** by taking care of avoiding
looping code using the vectorized idioms of those libraries. In practice
this means trying to **replace any nested for loops by calls to equivalent
Numpy array methods**. The goal is to avoid the CPU wasting time in the
Python interpreter rather than crunching numbers to fit your statistical
model. It's generally a good idea to consider NumPy and SciPy performance tips:
http://scipy.github.io/old-wiki/pages/PerformanceTips
Sometimes however an algorithm cannot be expressed efficiently in simple
vectorized Numpy code. In this case, the recommended strategy is the
following:
1. **Profile** the Python implementation to find the main bottleneck and
isolate it in a **dedicated module level function**. This function
will be reimplemented as a compiled extension module.
2. If there exists a well maintained BSD or MIT **C/C++** implementation
of the same algorithm that is not too big, you can write a
**Cython wrapper** for it and include a copy of the source code
of the library in the scikit-learn source tree: this strategy is
used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
:class:`linear_model.LogisticRegression` (wrappers for liblinear
and libsvm).
3. Otherwise, write an optimized version of your Python function using
**Cython** directly. This strategy is used
for the :class:`linear_model.ElasticNet` and
:class:`linear_model.SGDClassifier` classes for instance.
4. **Move the Python version of the function in the tests** and use
it to check that the results of the compiled extension are consistent
with the gold standard, easy to debug Python version.
5. Once the code is optimized (not simple bottleneck spottable by
profiling), check whether it is possible to have **coarse grained
parallelism** that is amenable to **multi-processing** by using the
``joblib.Parallel`` class.
When using Cython, use either
$ python setup.py build_ext -i
$ python setup.py install
to generate C files. You are responsible for adding .c/.cpp extensions along
with build parameters in each submodule ``setup.py``.
C/C++ generated files are embedded in distributed stable packages. The goal is
to make it possible to install scikit-learn stable version
on any machine with Python, Numpy, Scipy and C/C++ compiler.
Fast matrix multiplications
===========================
Matrix multiplications (matrix-matrix and matrix-vector) are usually handled
using the NumPy function ``np.dot``, but in versions of NumPy before 1.7.2
this function is suboptimal when the inputs are not both in the C (row-major)
layout; in that case, the inputs may be implicitly copied to obtain the right
layout. This obviously consumes memory and takes time.
The function ``fast_dot`` in ``sklearn.utils.extmath`` offers a fast
replacement for ``np.dot`` that prevents copies from being made in some cases.
In all other cases, it dispatches to ``np.dot`` and when the NumPy version is
new enough, it is in fact an alias for that function, making it a drop-in
replacement. Example usage of ``fast_dot``::
>>> import numpy as np
>>> from sklearn.utils.extmath import fast_dot
>>> X = np.random.random_sample([2, 10])
>>> np.allclose(np.dot(X, X.T), fast_dot(X, X.T))
True
This function operates optimally on 2-dimensional arrays, both of the same
dtype, which should be either single or double precision float. If these
requirements aren't met or the BLAS package is not available, the call is
silently dispatched to ``numpy.dot``. If you want to be sure when the original
``numpy.dot`` has been invoked in a situation where it is suboptimal, you can
activate the related warning::
>>> import warnings
>>> from sklearn.exceptions import NonBLASDotWarning
>>> warnings.simplefilter('always', NonBLASDotWarning) # doctest: +SKIP
.. _profiling-python-code:
Profiling Python code
=====================
In order to profile Python code we recommend to write a script that
loads and prepare you data and then use the IPython integrated profiler
for interactively exploring the relevant part for the code.
Suppose we want to profile the Non Negative Matrix Factorization module
of the scikit. Let us setup a new IPython session and load the digits
dataset and as in the :ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py` example::
In [1]: from sklearn.decomposition import NMF
In [2]: from sklearn.datasets import load_digits
In [3]: X = load_digits().data
Before starting the profiling session and engaging in tentative
optimization iterations, it is important to measure the total execution
time of the function we want to optimize without any kind of profiler
overhead and save it somewhere for later reference::
In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop
To have a look at the overall performance profile using the ``%prun``
magic command::
In [5]: %prun -l nmf.py NMF(n_components=16, tol=1e-2).fit(X)
14496 function calls in 1.682 CPU seconds
Ordered by: internal time
List reduced from 90 to 9 due to restriction <'nmf.py'>
ncalls tottime percall cumtime percall filename:lineno(function)
36 0.609 0.017 1.499 0.042 nmf.py:151(_nls_subproblem)
1263 0.157 0.000 0.157 0.000 nmf.py:18(_pos)
1 0.053 0.053 1.681 1.681 nmf.py:352(fit_transform)
673 0.008 0.000 0.057 0.000 nmf.py:28(norm)
1 0.006 0.006 0.047 0.047 nmf.py:42(_initialize_nmf)
36 0.001 0.000 0.010 0.000 nmf.py:36(_sparseness)
30 0.001 0.000 0.001 0.000 nmf.py:23(_neg)
1 0.000 0.000 0.000 0.000 nmf.py:337(__init__)
1 0.000 0.000 1.681 1.681 nmf.py:461(fit)
The ``tottime`` column is the most interesting: it gives to total time spent
executing the code of a given function ignoring the time spent in executing the
sub-functions. The real total time (local code + sub-function calls) is given by
the ``cumtime`` column.
Note the use of the ``-l nmf.py`` that restricts the output to lines that
contains the "nmf.py" string. This is useful to have a quick look at the hotspot
of the nmf Python module it-self ignoring anything else.
Here is the beginning of the output of the same command without the ``-l nmf.py``
filter::
In [5] %prun NMF(n_components=16, tol=1e-2).fit(X)
16159 function calls in 1.840 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2833 0.653 0.000 0.653 0.000 {numpy.core._dotblas.dot}
46 0.651 0.014 1.636 0.036 nmf.py:151(_nls_subproblem)
1397 0.171 0.000 0.171 0.000 nmf.py:18(_pos)
2780 0.167 0.000 0.167 0.000 {method 'sum' of 'numpy.ndarray' objects}
1 0.064 0.064 1.840 1.840 nmf.py:352(fit_transform)
1542 0.043 0.000 0.043 0.000 {method 'flatten' of 'numpy.ndarray' objects}
337 0.019 0.000 0.019 0.000 {method 'all' of 'numpy.ndarray' objects}
2734 0.011 0.000 0.181 0.000 fromnumeric.py:1185(sum)
2 0.010 0.005 0.010 0.005 {numpy.linalg.lapack_lite.dgesdd}
748 0.009 0.000 0.065 0.000 nmf.py:28(norm)
...
The above results show that the execution is largely dominated by
dot products operations (delegated to blas). Hence there is probably
no huge gain to expect by rewriting this code in Cython or C/C++: in
this case out of the 1.7s total execution time, almost 0.7s are spent
in compiled code we can consider optimal. By rewriting the rest of the
Python code and assuming we could achieve a 1000% boost on this portion
(which is highly unlikely given the shallowness of the Python loops),
we would not gain more than a 2.4x speed-up globally.
Hence major improvements can only be achieved by **algorithmic
improvements** in this particular example (e.g. trying to find operation
that are both costly and useless to avoid computing then rather than
trying to optimize their implementation).
It is however still interesting to check what's happening inside the
``_nls_subproblem`` function which is the hotspot if we only consider
Python code: it takes around 100% of the accumulated time of the module. In
order to better understand the profile of this specific function, let
us install ``line-prof`` and wire it to IPython::
$ pip install line-profiler
- **Under IPython <= 0.10**, edit ``~/.ipython/ipy_user_conf.py`` and
ensure the following lines are present::
import IPython.ipapi
ip = IPython.ipapi.get()
Towards the end of the file, define the ``%lprun`` magic::
import line_profiler
ip.expose_magic('lprun', line_profiler.magic_lprun)
- **Under IPython 0.11+**, first create a configuration profile::
$ ipython profile create
Then create a file named ``~/.ipython/extensions/line_profiler_ext.py`` with
the following content::
import line_profiler
def load_ipython_extension(ip):
ip.define_magic('lprun', line_profiler.magic_lprun)
Then register it in ``~/.ipython/profile_default/ipython_config.py``::
c.TerminalIPythonApp.extensions = [
'line_profiler_ext',
]
c.InteractiveShellApp.extensions = [
'line_profiler_ext',
]
This will register the ``%lprun`` magic command in the IPython terminal
application and the other frontends such as qtconsole and notebook.
Now restart IPython and let us use this new toy::
In [1]: from sklearn.datasets import load_digits
In [2]: from sklearn.decomposition.nmf import _nls_subproblem, NMF
In [3]: X = load_digits().data
In [4]: %lprun -f _nls_subproblem NMF(n_components=16, tol=1e-2).fit(X)
Timer unit: 1e-06 s
File: sklearn/decomposition/nmf.py
Function: _nls_subproblem at line 137
Total time: 1.73153 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 def _nls_subproblem(V, W, H_init, tol, max_iter):
138 """Non-negative least square solver
...
170 """
171 48 5863 122.1 0.3 if (H_init < 0).any():
172 raise ValueError("Negative values in H_init passed to NLS solver.")
173
174 48 139 2.9 0.0 H = H_init
175 48 112141 2336.3 5.8 WtV = np.dot(W.T, V)
176 48 16144 336.3 0.8 WtW = np.dot(W.T, W)
177
178 # values justified in the paper
179 48 144 3.0 0.0 alpha = 1
180 48 113 2.4 0.0 beta = 0.1
181 638 1880 2.9 0.1 for n_iter in xrange(1, max_iter + 1):
182 638 195133 305.9 10.2 grad = np.dot(WtW, H) - WtV
183 638 495761 777.1 25.9 proj_gradient = norm(grad[np.logical_or(grad < 0, H > 0)])
184 638 2449 3.8 0.1 if proj_gradient < tol:
185 48 130 2.7 0.0 break
186
187 1474 4474 3.0 0.2 for inner_iter in xrange(1, 20):
188 1474 83833 56.9 4.4 Hn = H - alpha * grad
189 # Hn = np.where(Hn > 0, Hn, 0)
190 1474 194239 131.8 10.1 Hn = _pos(Hn)
191 1474 48858 33.1 2.5 d = Hn - H
192 1474 150407 102.0 7.8 gradd = np.sum(grad * d)
193 1474 515390 349.7 26.9 dQd = np.sum(np.dot(WtW, d) * d)
...
By looking at the top values of the ``% Time`` column it is really easy to
pin-point the most expensive expressions that would deserve additional care.
Memory usage profiling
======================
You can analyze in detail the memory usage of any Python code with the help of
`memory_profiler <https://pypi.python.org/pypi/memory_profiler>`_. First,
install the latest version::
$ pip install -U memory_profiler
Then, setup the magics in a manner similar to ``line_profiler``.
- **Under IPython <= 0.10**, edit ``~/.ipython/ipy_user_conf.py`` and
ensure the following lines are present::
import IPython.ipapi
ip = IPython.ipapi.get()
Towards the end of the file, define the ``%memit`` and ``%mprun`` magics::
import memory_profiler
ip.expose_magic('memit', memory_profiler.magic_memit)
ip.expose_magic('mprun', memory_profiler.magic_mprun)
- **Under IPython 0.11+**, first create a configuration profile::
$ ipython profile create
Then create a file named ``~/.ipython/extensions/memory_profiler_ext.py``
with the following content::
import memory_profiler
def load_ipython_extension(ip):
ip.define_magic('memit', memory_profiler.magic_memit)
ip.define_magic('mprun', memory_profiler.magic_mprun)
Then register it in ``~/.ipython/profile_default/ipython_config.py``::
c.TerminalIPythonApp.extensions = [
'memory_profiler_ext',
]
c.InteractiveShellApp.extensions = [
'memory_profiler_ext',
]
This will register the ``%memit`` and ``%mprun`` magic commands in the
IPython terminal application and the other frontends such as qtconsole and
notebook.
``%mprun`` is useful to examine, line-by-line, the memory usage of key
functions in your program. It is very similar to ``%lprun``, discussed in the
previous section. For example, from the ``memory_profiler`` ``examples``
directory::
In [1] from example import my_func
In [2] %mprun -f my_func my_func()
Filename: example.py
Line # Mem usage Increment Line Contents
==============================================
3 @profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
Another useful magic that ``memory_profiler`` defines is ``%memit``, which is
analogous to ``%timeit``. It can be used as follows::
In [1]: import numpy as np
In [2]: %memit np.zeros(1e7)
maximum of 3: 76.402344 MB per loop
For more details, see the docstrings of the magics, using ``%memit?`` and
``%mprun?``.
Performance tips for the Cython developer
=========================================
If profiling of the Python code reveals that the Python interpreter
overhead is larger by one order of magnitude or more than the cost of the
actual numerical computation (e.g. ``for`` loops over vector components,
nested evaluation of conditional expression, scalar arithmetic...), it
is probably adequate to extract the hotspot portion of the code as a
standalone function in a ``.pyx`` file, add static type declarations and
then use Cython to generate a C program suitable to be compiled as a
Python extension module.
The official documentation available at http://docs.cython.org/ contains
a tutorial and reference guide for developing such a module. In the
following we will just highlight a couple of tricks that we found
important in practice on the existing cython codebase in the scikit-learn
project.
TODO: html report, type declarations, bound checks, division by zero checks,
memory alignment, direct blas calls...
- https://www.youtube.com/watch?v=gMvkiQ-gOW8
- http://conference.scipy.org/proceedings/SciPy2009/paper_1/
- http://conference.scipy.org/proceedings/SciPy2009/paper_2/
.. _profiling-compiled-extension:
Profiling compiled extensions
=============================
When working with compiled extensions (written in C/C++ with a wrapper or
directly as Cython extension), the default Python profiler is useless:
we need a dedicated tool to introspect what's happening inside the
compiled extension it-self.
Using yep and google-perftools
--------------------------------
Easy profiling without special compilation options use yep:
- https://pypi.python.org/pypi/yep
- http://fa.bianp.net/blog/2011/a-profiler-for-python-extensions
.. note::
google-perftools provides a nice 'line by line' report mode that
can be triggered with the ``--lines`` option. However this
does not seem to work correctly at the time of writing. This
issue can be tracked on the `project issue tracker
<https://github.com/gperftools/gperftools>`_.
Using gprof
-------------
In order to profile compiled Python extensions one could use ``gprof``
after having recompiled the project with ``gcc -pg`` and using the
``python-dbg`` variant of the interpreter on debian / ubuntu: however
this approach requires to also have ``numpy`` and ``scipy`` recompiled
with ``-pg`` which is rather complicated to get working.
Fortunately there exist two alternative profilers that don't require you to
recompile everything.
Using valgrind / callgrind / kcachegrind
----------------------------------------
TODO
Multi-core parallelism using ``joblib.Parallel``
================================================
TODO: give a simple teaser example here.
Checkout the official joblib documentation:
- https://pythonhosted.org/joblib
.. _warm-restarts:
A sample algorithmic trick: warm restarts for cross validation
==============================================================
TODO: demonstrate the warm restart tricks for cross validation of linear
regression with Coordinate Descent.