Skip to content

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Sep 11, 2022

Perf improvement when merging on a sorted MultiIndex. The improvement comes from avoiding MultiIndex._values.

Overall, seems to generalize across dtypes and merge types quite well. The ASVs show one slowdown when doing an outer merge with a datetime64 index. In that case the time is spent within MultiIndex._union. Coincidently, @phofl just opened #48505 which shows a nice improvement for datetimes in MultiIndex._union so that slowdown might well go away if #48505 is merged.

ASVs added:

       before           after         ratio
     [fe9e5d02]       [367fbd89]
     <main>           <perf-merge-with-mulitindexes>
+         259±3ms          383±2ms     1.48  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'outer')
-         101±3ms         85.1±1ms     0.84  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'inner')
-        51.0±1ms       40.6±0.4ms     0.80  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'right')
-      50.6±0.5ms       39.3±0.3ms     0.78  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'left')
-         185±3ms        138±0.4ms     0.75  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'inner')
-        186±20ms          137±1ms     0.73  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'outer')
-      83.0±0.7ms       46.6±0.8ms     0.56  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'inner')
-      36.9±0.3ms       11.4±0.1ms     0.31  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'left')
-        37.6±1ms       11.5±0.2ms     0.31  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'right')
-         146±3ms       14.0±0.1ms     0.10  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'left')
-         146±3ms       11.7±0.4ms     0.08  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'right')

[update]: updated times following the merge of #48505:

       before           after         ratio
     [ac648eea]       [666d3990]
     <perf-merge-with-mulitindexes^2>       <perf-merge-with-mulitindexes>
-       103±0.4ms         85.2±2ms     0.83  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'inner')
-         212±1ms          165±2ms     0.78  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'outer')
-      53.4±0.9ms       41.2±0.5ms     0.77  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'right')
-      52.4±0.4ms       40.1±0.7ms     0.76  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('Int64', 'Int64'), 'left')
-         186±2ms        139±0.7ms     0.75  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'inner')
-         265±3ms        184±0.8ms     0.69  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'outer')
-      85.5±0.4ms         48.3±2ms     0.57  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'inner')
-         196±3ms       78.9±0.5ms     0.40  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'outer')
-      40.7±0.4ms         14.0±3ms     0.34  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'right')
-        40.6±3ms       12.1±0.5ms     0.30  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('int64', 'int64'), 'left')
-       150±0.6ms       12.7±0.2ms     0.09  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'right')
-         153±5ms       12.2±0.3ms     0.08  join_merge.MergeMultiIndex.time_merge_sorted_multiindex(('datetime64[ns]', 'int64'), 'left')

@lukemanley lukemanley added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode MultiIndex labels Sep 11, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is faster in all possible cases? Meaning different dtypes

@lukemanley
Copy link
Member Author

I just expanded the ASVs to cover additional dtypes and merge types. See the timings updated timings and commentary re: MultiIndex._union.

@lukemanley
Copy link
Member Author

@phofl - see updated times in the summary

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added this to the 1.6 milestone Sep 19, 2022
@mroeschke mroeschke merged commit 81b5f1d into pandas-dev:main Sep 19, 2022
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the perf-merge-with-mulitindexes branch September 24, 2022 00:48
@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* merge on sorted multiindex performance

* whatsnew

* faster asv

* additional asv cases

* avoid going through multi._values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants