Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958

jennolsen84 · 2016-01-05T00:54:47Z

Hi,

When trying to concat() multiple big fortran order arrays, there is a big performance hit, as most of the work goes into calling ravel().

See:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4772

You can see the is_null(self) is using just a few values from the data after calling .ravel()

An easy fix is to change that line to values_flat = values.ravel(order='K')

Here is a link to numpy.ravel docs: http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ravel.html

‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.

The text was updated successfully, but these errors were encountered:

jreback · 2016-01-05T00:59:24Z

you have to work really hard to construct fortran ordered arrays with DataFrames, these are c-ordered by default.

If you optimize for a non-default you will have most other operations be sub-par.

why would you do this?

jennolsen84 · 2016-01-05T01:17:20Z

First, thank you for not just closing with "wont-fix". I really appreciate that.

Here is a simple example, where pandas changes the ordering for us :(.

Also, order='K' should be fast for both C and Fortran ordered arrays. No?

from pandas import *
data_panel = Panel(np.empty((100, 100, 100)), dtype=np.float32)
print(data_panel.values.flags)
print('----------------------')

with HDFStore('temp.h5', driver='H5FD_CORE') as f:
        f.put('df', data_panel)
        print(f.get('df').values.flags)

jreback · 2016-01-05T01:28:35Z

you are not giving a congruent example to what you have above.

secondly you know that in-memory HDF5 is not terrible fast nor efficient, right?

jennolsen84 · 2016-01-05T01:35:05Z

I am not sure what you mean. I found the ravel() call being an issue after looking at a profiler. I was not trying to use fortran ordered arrays, they were un-intentionally introduced by our usage of HDFStore.

Initially, I thought that pandas preferred Fortran ordered arrays, but after your comment, I suspect that the HDFStore behavior is not intentional.

So, we have two separate issues:

ravel call is forcing C ordering, when it probably doesn't need to. The patch I proposed should fix it, without hurting current users.
HDFStore is changing the ordering for our data (based on your comment "you have to work really hard to construct fortran ordered arrays with DataFrames, these are c-ordered by default")

I would be happy with a fix for #2. But in case we decide to use F arrays in the future, it would be nice to have order='K' being passed into ravel when possible as well.

jennolsen84 · 2016-01-05T01:44:17Z

HDF5 is used to serialize dataframes that can be sent across the wire, if you know of a better way please let me know. I left out the compression options, as compression is currently broken in pytables (and has been for a long time): PyTables/PyTables#461

jreback · 2016-01-05T01:47:12Z

quite a lot faster to use msgpack or pickle for over-the-wire actually. Nothing wrong with HDF5 for that though.

My point above, is you are showing using a Panel which has slightly different semantics. Show your code that is doing concatenation. (and what you are generating the frames or an example of that)

Ideally give me a complete example of the perf issue you are seeing.

jreback · 2016-01-05T01:49:05Z

you can also specify compression upfront FYI (though YMMV for whats faster). To be honest I don't really use compression with HDF5 its often much faster w/o. Unless you have LOTS of data (many many gigs). Which is invariably on-disk.

jennolsen84 · 2016-01-05T02:02:45Z

Oh, sorry -- meant to create the bug report for pd.Panel.

Here is the code, this takes ~35 seconds with ravel() and ~3 with ravel(order='K'). If a C-ordered array is used, it is still fast (~2 seconds).

import numpy as np
from pandas import *
dataset = np.zeros((10000, 200, 20), dtype=np.float32, order='F') # simulate HDFStore changing the data ordering by typing order='F'

panels = []
for x in range(20):
    panels.append(Panel(np.copy(dataset, order='F')))

import time
s = time.time()
concat(panels, axis=1)
e = time.time()
print('concat took', e-s)

jennolsen84 · 2016-01-05T02:17:11Z

We were passing compression options to HDFStore(), so that is broken as well. Does to_msgpack support compression? I think we looked at it before, and it was dropping the compression options. It was marked as EXPERIMENTAL (and still is), so we just stayed away. Our data is very compression friendly as there are a lot of duplicates. So we're hoping that (de)compression time + network xfer time is a lot less than uncompressed file network transfer time. Max network speed is 1Gbps. We could just compress the whole file afterwards using gzip/bz2/lzo, etc, but then we lose chunking (which we might want to use further down the road).

jreback · 2016-01-05T13:02:21Z

In [28]: panel = np.zeros((10000, 200, 2), dtype=np.float32, order='F')

In [29]: panels_f = [ Panel(np.copy(panel, order='F')) for i in range(20) ]

In [30]: panels_c = [ Panel(np.copy(panel, order='C')) for i in range(20) ]

In [33]: frame = np.zeros((10000, 200), dtype=np.float32, order='F')

In [34]: frames_f = [ DataFrame(np.copy(dataset, order='F')) for i in range(20) ]

In [35]: frames_c = [ DataFrame(np.copy(dataset, order='C')) for i in range(20) ]

In [43]: %timeit pd.concat(panels_c,axis=0,ignore_index=True)
10 loops, best of 3: 138 ms per loop

In [40]: %timeit pd.concat(panels_f,axis=0,ignore_index=True)
1 loops, best of 3: 796 ms per loop

In [31]: %timeit pd.concat(panels_c,axis=1,ignore_index=True)
10 loops, best of 3: 197 ms per loop

In [32]: %timeit pd.concat(panels_f,axis=1,ignore_index=True)
1 loops, best of 3: 572 ms per loop

In [42]: %timeit pd.concat(panels_c,axis=2,ignore_index=True)
1 loops, best of 3: 850 ms per loop

In [41]: %timeit pd.concat(panels_f,axis=2,ignore_index=True)
1 loops, best of 3: 623 ms per loop

In [39]: %timeit pd.concat(frames_c,ignore_index=True,axis=0)
1 loops, best of 3: 236 ms per loop

In [38]: %timeit pd.concat(frames_f,ignore_index=True,axis=0)
10 loops, best of 3: 97.8 ms per loop

In [36]: %timeit pd.concat(frames_c,ignore_index=True,axis=1)
1 loops, best of 3: 261 ms per loop

In [37]: %timeit pd.concat(frames_f,ignore_index=True,axis=1)
10 loops, best of 3: 66.4 ms per loop

jreback · 2016-01-05T13:03:16Z

@jennolsen84 so if you'd like to change that line and see what effect this has on all of the benchmarks above (and add these to the asv suite). would be greatly appreciated.

jennolsen84 · 2016-01-05T18:09:28Z

Here is the effect. I will look into the asv suite.

All times in seconds

I was surprised to see a speedup in the case of (DataFrame, C-order, axis=0). I re-verified it for that case by backing out the change, and the speedup was still there.

P or DF	ordering	axis	before	after	% original time
Panel	C	0	0.531	0.537	101.13%
Panel	F	0	1.34	0.884	65.97%
Panel	C	1	0.612	0.604	98.69%
Panel	F	1	1.01	0.571	56.53%
Panel	C	2	1.47	1.48	100.68%
Panel	F	2	0.968	0.55	56.82%
DataFrame	C	0	0.425	0.286	67.29%
DataFrame	F	0	0.286	0.276	96.50%
DataFrame	C	1	0.508	0.365	71.85%
DataFrame	F	1	0.244	0.246	100.82%

Other questions: There are a few other uses of ravel in the core/internals.py. Some are simple analyze at (e.g. ObjectBlock.is_bool property). Others I am not sure.

Also, should I open another issue for HDFStore() changing the data ordering to Fortran instead of C?

jennolsen84 · 2016-01-05T20:26:01Z

Tried setting up asv, but not sure why this is failing. I am using python 3.5, but it is trying to install py 2.7 packages, and it seems to be failing.

~/pandas/asv_bench$ time asv continuous master HEAD -b groupby.groupby_agg_builtins1
· Creating environments....
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Error running /home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/bin/python /home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py discover /home/jenn/pandas/asv_bench/benchmarks /tmp/tmp59jwybho
             STDOUT -------->

             STDERR -------->
             Traceback (most recent call last):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 786, in <module>
                 commands[mode](args)
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 731, in main_discover
                 list_benchmarks(benchmark_dir, fp)
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 716, in list_benchmarks
                 for benchmark in disc_benchmarks(root):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 701, in disc_benchmarks
                 for module in disc_files(root, os.path.basename(root) + '.'):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 690, in disc_files
                 module = import_module(package + filename)
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/importlib/__init__.py", line 37, in import_module
                 __import__(name)
               File "/home/jenn/pandas/asv_bench/benchmarks/packers.py", line 8, in <module>
                 from sqlalchemy import create_engine
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/__init__.py", line 9, in <module>
                 from .sql import (
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/__init__.py", line 8, in <module>
                 from .expression import (
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 30, in <module>
                 from .visitors import Visitable
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 28, in <module>
                 from .. import util
             ImportError: cannot import name util
Traceback (most recent call last):
  File "/home/jenn/miniconda3/envs/pandas_dev/bin/asv", line 9, in <module>
    load_entry_point('asv==0.2.dev857+17b70f9a', 'console_scripts', 'asv')()
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/main.py", line 36, in main
    result = args.func(args)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/__init__.py", line 48, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/continuous.py", line 49, in run_from_conf_args
    **kwargs
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/continuous.py", line 73, in run
    _machine_file=_machine_file)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/run.py", line 198, in run
    benchmarks = Benchmarks(conf, regex=bench)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmarks.py", line 289, in __init__
    benchmarks = self.disc_benchmarks(conf)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmarks.py", line 339, in disc_benchmarks
    dots=False)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/plugins/conda.py", line 120, in run
    return self.run_executable('python', args, **kwargs)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/environment.py", line 489, in run_executable
    return util.check_output([exe] + args, **kwargs)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/util.py", line 497, in check_output
    raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command '/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/bin/python /home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py discover /home/jenn/pandas/asv_bench/benchmarks /tmp/tmp59jwybho' returned non-zero exit status 1

real    2m24.706s
user    2m20.776s
sys 0m4.700s

TomAugspurger · 2016-01-05T20:34:36Z

Are you using conda or virutalenv for handling virtual environments? You can edit asv_bench/asv.conf.json to use either, and change the python version to 3.5.

Side not, should we include a default default-asv.conf.json in the source repo, and have users copy it toasv.conf.json and edit it as needed?

jreback · 2016-01-05T20:36:44Z

I thought the config is checked in?

TomAugspurger · 2016-01-05T20:43:54Z

Yeah the config is, but if you make any changes to it you need to exclude it from your commit. If we rename it and have people copy it to where asv expects it, you won't have any issues. Either way, not a big deal.

jennolsen84 · 2016-01-05T23:14:16Z

I ran the asv benchmarks twice... first time:

not sure what happened... but everything is horrible.

    before     after       ratio
  [1dc78c71] [db58191a]
+   41.53ms      3.06s     73.66  join_merge.concat_panels.time_concat_c_ordered_axis0
+   65.91ms      1.03s     15.70  join_merge.concat_panels.time_concat_c_ordered_axis1
+  114.20ms      1.27s     11.16  join_merge.concat_dataframes.time_concat_c_ordered_axis0
+  327.63ms      2.02s      6.16  join_merge.concat_panels.time_concat_f_ordered_axis1
+  324.52ms      1.68s      5.18  join_merge.concat_panels.time_concat_f_ordered_axis2
+     1.18s      3.33s      2.82  join_merge.concat_panels.time_concat_c_ordered_axis2
+     1.02s      2.39s      2.35  join_merge.concat_panels.time_concat_f_ordered_axis0
-  186.77ms    93.05ms      0.50  join_merge.concat_dataframes.time_concat_c_ordered_axis1

second time, things were a lot better:

    before     after       ratio
  [1dc78c71] [db58191a]
-  531.24ms   252.25ms      0.47  join_merge.concat_panels.time_concat_f_ordered_axis0
-  111.06ms    18.93ms      0.17  join_merge.concat_dataframes.time_concat_c_ordered_axis0
-  326.81ms    36.03ms      0.11  join_merge.concat_panels.time_concat_f_ordered_axis2
-  372.73ms    35.83ms      0.10  join_merge.concat_panels.time_concat_f_ordered_axis1
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Should I submit a PR?

jreback · 2016-01-05T23:16:11Z

that would be great!

jreback · 2016-01-06T13:36:18Z

@jennolsen84 yes, pls open an issue for the Panel returning F order from HDFStore.

We are not guaranteeing order of the data, but will have a look.

jreback · 2016-01-08T13:51:46Z

closed by #11967

jreback added Performance Memory or execution speed performance Usage Question labels Jan 5, 2016

jennolsen84 mentioned this issue Jan 5, 2016

Concat perf #11967

Closed

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 6, 2016

jreback added this to the Next Major Release milestone Jan 6, 2016

jreback closed this as completed Jan 8, 2016

Uh oh!

Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958

Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958

Comments

jennolsen84 commented Jan 5, 2016

jreback commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

TomAugspurger commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

TomAugspurger commented Jan 5, 2016

Uh oh!

jennolsen84 commented Jan 5, 2016

Uh oh!

jreback commented Jan 5, 2016

Uh oh!

jreback commented Jan 6, 2016

Uh oh!

jreback commented Jan 8, 2016

Uh oh!