-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you have to work really hard to construct fortran ordered arrays with DataFrames, these are c-ordered by default. If you optimize for a non-default you will have most other operations be sub-par. why would you do this? |
First, thank you for not just closing with "wont-fix". I really appreciate that. Here is a simple example, where pandas changes the ordering for us :(. Also, order='K' should be fast for both C and Fortran ordered arrays. No?
|
you are not giving a congruent example to what you have above. secondly you know that in-memory HDF5 is not terrible fast nor efficient, right? |
I am not sure what you mean. I found the ravel() call being an issue after looking at a profiler. I was not trying to use fortran ordered arrays, they were un-intentionally introduced by our usage of HDFStore. Initially, I thought that pandas preferred Fortran ordered arrays, but after your comment, I suspect that the HDFStore behavior is not intentional. So, we have two separate issues:
I would be happy with a fix for #2. But in case we decide to use F arrays in the future, it would be nice to have order='K' being passed into ravel when possible as well. |
HDF5 is used to serialize dataframes that can be sent across the wire, if you know of a better way please let me know. I left out the compression options, as compression is currently broken in pytables (and has been for a long time): PyTables/PyTables#461 |
quite a lot faster to use My point above, is you are showing using a Ideally give me a complete example of the perf issue you are seeing. |
you can also specify compression upfront FYI (though YMMV for whats faster). To be honest I don't really use compression with HDF5 its often much faster w/o. Unless you have LOTS of data (many many gigs). Which is invariably on-disk. |
Oh, sorry -- meant to create the bug report for pd.Panel. Here is the code, this takes ~35 seconds with
|
We were passing compression options to HDFStore(), so that is broken as well. Does to_msgpack support compression? I think we looked at it before, and it was dropping the compression options. It was marked as EXPERIMENTAL (and still is), so we just stayed away. Our data is very compression friendly as there are a lot of duplicates. So we're hoping that (de)compression time + network xfer time is a lot less than uncompressed file network transfer time. Max network speed is 1Gbps. We could just compress the whole file afterwards using gzip/bz2/lzo, etc, but then we lose chunking (which we might want to use further down the road). |
|
@jennolsen84 so if you'd like to change that line and see what effect this has on all of the benchmarks above (and add these to the asv suite). would be greatly appreciated. |
Here is the effect. I will look into the asv suite. All times in seconds I was surprised to see a speedup in the case of (DataFrame, C-order, axis=0). I re-verified it for that case by backing out the change, and the speedup was still there.
Other questions: There are a few other uses of ravel in the core/internals.py. Some are simple analyze at (e.g. Also, should I open another issue for |
Tried setting up asv, but not sure why this is failing. I am using python 3.5, but it is trying to install py 2.7 packages, and it seems to be failing.
|
Are you using conda or virutalenv for handling virtual environments? You can edit Side not, should we include a default |
I thought the config is checked in? |
Yeah the config is, but if you make any changes to it you need to exclude it from your commit. If we rename it and have people copy it to where |
I ran the asv benchmarks twice... first time: not sure what happened... but everything is horrible.
second time, things were a lot better:
Should I submit a PR? |
that would be great! |
@jennolsen84 yes, pls open an issue for the We are not guaranteeing order of the data, but will have a look. |
closed by #11967 |
Hi,
When trying to concat() multiple big fortran order arrays, there is a big performance hit, as most of the work goes into calling ravel().
See:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4772
You can see the
is_null(self)
is using just a few values from the data after calling.ravel()
An easy fix is to change that line to
values_flat = values.ravel(order='K')
Here is a link to numpy.ravel docs: http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ravel.html
‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.
The text was updated successfully, but these errors were encountered: