Skip to content

PERF, BENCH: Fix performance issue when indexing into non-unique DatetimeIndex/PeriodIndex. #27136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 6, 2019

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jun 30, 2019

This PR fixes a number of issues involving non-unique non-numeric indexing:

  • The NonNumericSeriesIndexing benchmark incorrectly assumes tm.makeStringIndex() returns a monotonic Index. It does not and unfortunately invalidates the existing results here.
    • While modifying this, figured should add PeriodIndex and non_monotonic cases
  • The setup() function creates an index but does not trigger the engine mapping to be populated. This leads to noisy and erroneous results.
  • A PeriodIndex or DatetimeIndex with non-unique, monotonic data creates an entirely new Index when being sliced. This means the engine mapping gets recomputed on every list-like query.
  • A PeriodIndex is actually a PeriodEngine plus a separate Int64Index (with associated Int64Engine). The former contains a mapping of {Period: value} while being a subclass of Int64Engine. This means calls to _bin_search() fail as the underlying array consists of Period objects. This only occurs when the index is non-unique, monotonic, and >= 1M elements long.
    • We add an explicit test for this failure mode
  • A number of index calls pass value.ordinal to the PeriodEngine; as that is a {Period: value} mapping, all such calls fail.
    • We instead pass value.ordinal queries to the Int64Engine, where they succeed.
  • Finally, PeriodIndex.get_memory() does not account for the memory of its associated Int64Index. This is fixed by simply adding it to the total.
       before           after         ratio
     [2811464a]       [a68caa2a]
     <import_bench~1>       <period_fix>
           failed         62.4±2μs      n/a  indexing.NonNumericSeriesIndexing.time_get_value('period', 'unique_monotonic_inc')
           failed         216±20μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('period', 'unique_monotonic_inc')
           failed      2.74±0.05ms      n/a  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'unique_monotonic_inc')
           failed        160±0.5μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('period', 'unique_monotonic_inc')
           failed       43.0±0.2μs      n/a  indexing.NonNumericSeriesIndexing.time_getitem_scalar('period', 'unique_monotonic_inc')
-        718±10μs         207±10μs     0.29  indexing.NonNumericSeriesIndexing.time_getitem_label_slice('period', 'non_monotonic')
-     2.53±0.05ms         701±10μs     0.28  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'non_monotonic')
-      33.8±0.9ms      2.32±0.04ms     0.07  indexing.NonNumericSeriesIndexing.time_getitem_list_like('period', 'nonunique_monotonic_inc')
-      62.7±0.5ms          532±3μs     0.01  indexing.NonNumericSeriesIndexing.time_getitem_list_like('datetime', 'nonunique_monotonic_inc')

       before           after         ratio
     [2811464a]       [a68caa2a]
     <import_bench~1>       <period_fix>
+      40.7±0.7μs       46.8±0.8μs     1.15  indexing.NonNumericSeriesIndexing.time_get_value('period', 'non_monotonic')
+        27.0±2μs       30.4±0.6μs     1.12  indexing.NonNumericSeriesIndexing.time_getitem_scalar('period', 'non_monotonic')
  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@pep8speaks
Copy link

pep8speaks commented Jun 30, 2019

Hello @qwhelan! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-06 21:25:48 UTC

@qwhelan qwhelan force-pushed the period_fix branch 2 times, most recently from a857b09 to 1881cb5 Compare June 30, 2019 00:18
@@ -906,6 +923,10 @@ def base(self):
FutureWarning, stacklevel=2)
return np.asarray(self._data)

def memory_usage(self, deep=False):
result = super(PeriodIndex, self).memory_usage(deep=deep)
result += self._int64index.memory_usage(deep=deep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only should add the int64index if it’s actually created

Copy link
Contributor Author

@qwhelan qwhelan Jun 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for '_int64index' in self._cache now, which avoids triggering creation.

@qwhelan qwhelan force-pushed the period_fix branch 5 times, most recently from ea5d08d to e03dc41 Compare July 2, 2019 14:48
@jreback jreback added Performance Memory or execution speed performance Period Period data type Datetime Datetime data dtype labels Jul 4, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also rebase on master; note that you will need to update with black

black . will pretty much fix things
(pip install black)

@@ -576,6 +576,30 @@ def test_indexing_over_size_cutoff():
_index._SIZE_CUTOFF = old_cutoff


def test_indexing_over_size_cutoff_period_index():

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here (the title is pretty good though) with the issue number as a comment as well

@qwhelan qwhelan force-pushed the period_fix branch 3 times, most recently from 26b2c1e to 3809224 Compare July 6, 2019 04:10
qwhelan added 2 commits July 6, 2019 14:25
…timeIndex/PeriodIndex.

Additionally, fix asv benchmark bugs that hid the issue and correct PeriodIndex.memory_usage().
@jreback jreback added this to the 0.25.0 milestone Jul 6, 2019
@jreback
Copy link
Contributor

jreback commented Jul 6, 2019

lgtm. ping on green.

@jreback jreback merged commit af5b2a2 into pandas-dev:master Jul 6, 2019
@jreback
Copy link
Contributor

jreback commented Jul 6, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants