Skip to content

Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jennolsen84 opened this issue Jan 5, 2016 · 20 comments
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question

Comments

@jennolsen84
Copy link
Contributor

Hi,

When trying to concat() multiple big fortran order arrays, there is a big performance hit, as most of the work goes into calling ravel().

See:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4772

You can see the is_null(self) is using just a few values from the data after calling .ravel()

An easy fix is to change that line to values_flat = values.ravel(order='K')

Here is a link to numpy.ravel docs: http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ravel.html

‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

you have to work really hard to construct fortran ordered arrays with DataFrames, these are c-ordered by default.

If you optimize for a non-default you will have most other operations be sub-par.

why would you do this?

@jreback jreback added Performance Memory or execution speed performance Usage Question labels Jan 5, 2016
@jennolsen84
Copy link
Contributor Author

First, thank you for not just closing with "wont-fix". I really appreciate that.

Here is a simple example, where pandas changes the ordering for us :(.

Also, order='K' should be fast for both C and Fortran ordered arrays. No?

from pandas import *
data_panel = Panel(np.empty((100, 100, 100)), dtype=np.float32)
print(data_panel.values.flags)
print('----------------------')

with HDFStore('temp.h5', driver='H5FD_CORE') as f:
        f.put('df', data_panel)
        print(f.get('df').values.flags)

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

you are not giving a congruent example to what you have above.

secondly you know that in-memory HDF5 is not terrible fast nor efficient, right?

@jennolsen84
Copy link
Contributor Author

I am not sure what you mean. I found the ravel() call being an issue after looking at a profiler. I was not trying to use fortran ordered arrays, they were un-intentionally introduced by our usage of HDFStore.

Initially, I thought that pandas preferred Fortran ordered arrays, but after your comment, I suspect that the HDFStore behavior is not intentional.

So, we have two separate issues:

  1. ravel call is forcing C ordering, when it probably doesn't need to. The patch I proposed should fix it, without hurting current users.
  2. HDFStore is changing the ordering for our data (based on your comment "you have to work really hard to construct fortran ordered arrays with DataFrames, these are c-ordered by default")

I would be happy with a fix for #2. But in case we decide to use F arrays in the future, it would be nice to have order='K' being passed into ravel when possible as well.

@jennolsen84
Copy link
Contributor Author

HDF5 is used to serialize dataframes that can be sent across the wire, if you know of a better way please let me know. I left out the compression options, as compression is currently broken in pytables (and has been for a long time): PyTables/PyTables#461

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

quite a lot faster to use msgpack or pickle for over-the-wire actually. Nothing wrong with HDF5 for that though.

My point above, is you are showing using a Panel which has slightly different semantics. Show your code that is doing concatenation. (and what you are generating the frames or an example of that)

Ideally give me a complete example of the perf issue you are seeing.

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

you can also specify compression upfront FYI (though YMMV for whats faster). To be honest I don't really use compression with HDF5 its often much faster w/o. Unless you have LOTS of data (many many gigs). Which is invariably on-disk.

@jennolsen84
Copy link
Contributor Author

Oh, sorry -- meant to create the bug report for pd.Panel.

Here is the code, this takes ~35 seconds with ravel() and ~3 with ravel(order='K'). If a C-ordered array is used, it is still fast (~2 seconds).

import numpy as np
from pandas import *
dataset = np.zeros((10000, 200, 20), dtype=np.float32, order='F') # simulate HDFStore changing the data ordering by typing order='F'

panels = []
for x in range(20):
    panels.append(Panel(np.copy(dataset, order='F')))

import time
s = time.time()
concat(panels, axis=1)
e = time.time()
print('concat took', e-s)

@jennolsen84
Copy link
Contributor Author

We were passing compression options to HDFStore(), so that is broken as well. Does to_msgpack support compression? I think we looked at it before, and it was dropping the compression options. It was marked as EXPERIMENTAL (and still is), so we just stayed away. Our data is very compression friendly as there are a lot of duplicates. So we're hoping that (de)compression time + network xfer time is a lot less than uncompressed file network transfer time. Max network speed is 1Gbps. We could just compress the whole file afterwards using gzip/bz2/lzo, etc, but then we lose chunking (which we might want to use further down the road).

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

In [28]: panel = np.zeros((10000, 200, 2), dtype=np.float32, order='F')

In [29]: panels_f = [ Panel(np.copy(panel, order='F')) for i in range(20) ]

In [30]: panels_c = [ Panel(np.copy(panel, order='C')) for i in range(20) ]

In [33]: frame = np.zeros((10000, 200), dtype=np.float32, order='F')

In [34]: frames_f = [ DataFrame(np.copy(dataset, order='F')) for i in range(20) ]

In [35]: frames_c = [ DataFrame(np.copy(dataset, order='C')) for i in range(20) ]
In [43]: %timeit pd.concat(panels_c,axis=0,ignore_index=True)
10 loops, best of 3: 138 ms per loop

In [40]: %timeit pd.concat(panels_f,axis=0,ignore_index=True)
1 loops, best of 3: 796 ms per loop

In [31]: %timeit pd.concat(panels_c,axis=1,ignore_index=True)
10 loops, best of 3: 197 ms per loop

In [32]: %timeit pd.concat(panels_f,axis=1,ignore_index=True)
1 loops, best of 3: 572 ms per loop

In [42]: %timeit pd.concat(panels_c,axis=2,ignore_index=True)
1 loops, best of 3: 850 ms per loop

In [41]: %timeit pd.concat(panels_f,axis=2,ignore_index=True)
1 loops, best of 3: 623 ms per loop
In [39]: %timeit pd.concat(frames_c,ignore_index=True,axis=0)
1 loops, best of 3: 236 ms per loop

In [38]: %timeit pd.concat(frames_f,ignore_index=True,axis=0)
10 loops, best of 3: 97.8 ms per loop

In [36]: %timeit pd.concat(frames_c,ignore_index=True,axis=1)
1 loops, best of 3: 261 ms per loop

In [37]: %timeit pd.concat(frames_f,ignore_index=True,axis=1)
10 loops, best of 3: 66.4 ms per loop

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

@jennolsen84 so if you'd like to change that line and see what effect this has on all of the benchmarks above (and add these to the asv suite). would be greatly appreciated.

@jennolsen84
Copy link
Contributor Author

Here is the effect. I will look into the asv suite.

All times in seconds

I was surprised to see a speedup in the case of (DataFrame, C-order, axis=0). I re-verified it for that case by backing out the change, and the speedup was still there.

P or DF ordering axis before after % original time
Panel C 0 0.531 0.537 101.13%
Panel F 0 1.34 0.884 65.97%
Panel C 1 0.612 0.604 98.69%
Panel F 1 1.01 0.571 56.53%
Panel C 2 1.47 1.48 100.68%
Panel F 2 0.968 0.55 56.82%
DataFrame C 0 0.425 0.286 67.29%
DataFrame F 0 0.286 0.276 96.50%
DataFrame C 1 0.508 0.365 71.85%
DataFrame F 1 0.244 0.246 100.82%

Other questions: There are a few other uses of ravel in the core/internals.py. Some are simple analyze at (e.g. ObjectBlock.is_bool property). Others I am not sure.

Also, should I open another issue for HDFStore() changing the data ordering to Fortran instead of C?

@jennolsen84
Copy link
Contributor Author

Tried setting up asv, but not sure why this is failing. I am using python 3.5, but it is trying to install py 2.7 packages, and it seems to be failing.

~/pandas/asv_bench$ time asv continuous master HEAD -b groupby.groupby_agg_builtins1
· Creating environments....
· Discovering benchmarks
·· Uninstalling from py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building for py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
·· Installing into py2.7-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
·· Error running /home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/bin/python /home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py discover /home/jenn/pandas/asv_bench/benchmarks /tmp/tmp59jwybho
             STDOUT -------->

             STDERR -------->
             Traceback (most recent call last):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 786, in <module>
                 commands[mode](args)
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 731, in main_discover
                 list_benchmarks(benchmark_dir, fp)
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 716, in list_benchmarks
                 for benchmark in disc_benchmarks(root):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 701, in disc_benchmarks
                 for module in disc_files(root, os.path.basename(root) + '.'):
               File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py", line 690, in disc_files
                 module = import_module(package + filename)
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/importlib/__init__.py", line 37, in import_module
                 __import__(name)
               File "/home/jenn/pandas/asv_bench/benchmarks/packers.py", line 8, in <module>
                 from sqlalchemy import create_engine
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/__init__.py", line 9, in <module>
                 from .sql import (
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/__init__.py", line 8, in <module>
                 from .expression import (
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 30, in <module>
                 from .visitors import Visitable
               File "/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py", line 28, in <module>
                 from .. import util
             ImportError: cannot import name util
Traceback (most recent call last):
  File "/home/jenn/miniconda3/envs/pandas_dev/bin/asv", line 9, in <module>
    load_entry_point('asv==0.2.dev857+17b70f9a', 'console_scripts', 'asv')()
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/main.py", line 36, in main
    result = args.func(args)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/__init__.py", line 48, in run_from_args
    return cls.run_from_conf_args(conf, args)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/continuous.py", line 49, in run_from_conf_args
    **kwargs
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/continuous.py", line 73, in run
    _machine_file=_machine_file)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/commands/run.py", line 198, in run
    benchmarks = Benchmarks(conf, regex=bench)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmarks.py", line 289, in __init__
    benchmarks = self.disc_benchmarks(conf)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmarks.py", line 339, in disc_benchmarks
    dots=False)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/plugins/conda.py", line 120, in run
    return self.run_executable('python', args, **kwargs)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/environment.py", line 489, in run_executable
    return util.check_output([exe] + args, **kwargs)
  File "/home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/util.py", line 497, in check_output
    raise ProcessError(args, retcode, stdout, stderr)
asv.util.ProcessError: Command '/home/jenn/pandas/asv_bench/env/916946cc02b9fc5a85fec865f4ddfb9d/bin/python /home/jenn/miniconda3/envs/pandas_dev/lib/python3.5/site-packages/asv/benchmark.py discover /home/jenn/pandas/asv_bench/benchmarks /tmp/tmp59jwybho' returned non-zero exit status 1

real    2m24.706s
user    2m20.776s
sys 0m4.700s

@TomAugspurger
Copy link
Contributor

Are you using conda or virutalenv for handling virtual environments? You can edit asv_bench/asv.conf.json to use either, and change the python version to 3.5.

Side not, should we include a default default-asv.conf.json in the source repo, and have users copy it toasv.conf.json and edit it as needed?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

I thought the config is checked in?

@TomAugspurger
Copy link
Contributor

Yeah the config is, but if you make any changes to it you need to exclude it from your commit. If we rename it and have people copy it to where asv expects it, you won't have any issues. Either way, not a big deal.

@jennolsen84
Copy link
Contributor Author

I ran the asv benchmarks twice... first time:

not sure what happened... but everything is horrible.

    before     after       ratio
  [1dc78c71] [db58191a]
+   41.53ms      3.06s     73.66  join_merge.concat_panels.time_concat_c_ordered_axis0
+   65.91ms      1.03s     15.70  join_merge.concat_panels.time_concat_c_ordered_axis1
+  114.20ms      1.27s     11.16  join_merge.concat_dataframes.time_concat_c_ordered_axis0
+  327.63ms      2.02s      6.16  join_merge.concat_panels.time_concat_f_ordered_axis1
+  324.52ms      1.68s      5.18  join_merge.concat_panels.time_concat_f_ordered_axis2
+     1.18s      3.33s      2.82  join_merge.concat_panels.time_concat_c_ordered_axis2
+     1.02s      2.39s      2.35  join_merge.concat_panels.time_concat_f_ordered_axis0
-  186.77ms    93.05ms      0.50  join_merge.concat_dataframes.time_concat_c_ordered_axis1

second time, things were a lot better:

    before     after       ratio
  [1dc78c71] [db58191a]
-  531.24ms   252.25ms      0.47  join_merge.concat_panels.time_concat_f_ordered_axis0
-  111.06ms    18.93ms      0.17  join_merge.concat_dataframes.time_concat_c_ordered_axis0
-  326.81ms    36.03ms      0.11  join_merge.concat_panels.time_concat_f_ordered_axis2
-  372.73ms    35.83ms      0.10  join_merge.concat_panels.time_concat_f_ordered_axis1
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Should I submit a PR?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2016

that would be great!

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jan 6, 2016
@jreback jreback added this to the Next Major Release milestone Jan 6, 2016
@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

@jennolsen84 yes, pls open an issue for the Panel returning F order from HDFStore.

We are not guaranteeing order of the data, but will have a look.

@jreback
Copy link
Contributor

jreback commented Jan 8, 2016

closed by #11967

@jreback jreback closed this as completed Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants