BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

peterbell10 · 2021-07-05T22:12:11Z

Reference issue

What does this implement/fix?

The balanced_tree argument is meant to create a more balanced tree by splitting at the median instead of the midpoint of the known bounds. However, the "sliding" part of the sliding midpoint rule was also being applied after choosing the median which results in an unbalanced split for degenerate data. Instead, keeping it as the exact median results in a balanced tree.

Additional information

The comments state the median rule comes from sklearn, and this is equivalent to KDTree code:
https://github.com/scikit-learn/scikit-learn/blob/3ae7c7615343bbd36acece57825d8b0d70fd9da4/sklearn/neighbors/_binary_tree.pxi#L1176-L1181

cc @sturlamolden @rainwoodman

The point 2.1, 2.9 is equidistant from points 2, 2 and 3, 3 making the k-neighbours query sensitive to the exact tree structure.

sturlamolden · 2021-07-06T17:10:01Z

I think this PR is logically incorrect. The way our kd-tree works we cannot have the same value on both sides of the pivot (the splitting plane), which can happen in the case of ties. Our kd-tree cannot be exactly balanced in the presencence of ties. This is why we do data swapping after the partial sorting, when in theory all the swapping (partial sorting) should be completed. And then it follows that we need to consider sliding the splitting plane even when using the median as pivot. Otherwise we need to update all query methods to assume we can have equal values on both sides of a splitting plane. IIRC the kd-tree in sklearn does this, whereas ours does not because it started out using sliding midpoint and then added the median later on (mostly because it gave faster queries). I know this is annoying but I think the build method needs to be the way it is in order to be logically correct given what the query methods assume.

peterbell10 · 2021-07-06T18:24:59Z

I think this PR is logically incorrect. The way our kd-tree works we cannot have the same value on both sides of the pivot

If this is true, then the sliding midpoint rule is broken as well. Consider constructing a node where all children are equal, but the tree is built with compact=False so it doesn't inspect the bounds and generate a leaf. All points are going to be on the same side of the midpoint and so the split will be chosen as either just after the first element or just before the last element in the list. Since all points are equal, then the same value must exist on both sides of the pivot.

peterbell10 · 2021-07-06T18:26:13Z

If you agree with that, I should be able to fix both issues.

sturlamolden · 2021-07-06T19:02:29Z

It was constructed on the promise that
lesser < split <= greater.
You might be correct that the sliding midpoint is broken as well though, but I have never seen it break. Maybe it does not matter for the query functions. Or maybe someone has never constructed a bad enough data set and reported a failure. I am not sure, but certainly it was never intended to allow
lesser <= split <= greater.
But you might be right, perhaps we have two bugs to fix here.

peterbell10 · 2021-07-06T20:18:26Z

You might be correct that the sliding midpoint is broken as well though, but I have never seen it break.

Assuming this is an issue, the chances of it adversely effecting the result of a query is astronomically small. You would need to have the query radius exactly overlap with the point on the split line. In which case, the result is borderline anyway and you would expect query results to be unstable just due to machine rounding.

peterbell10 · 2021-07-06T20:19:09Z

So, this is more of a pedantically correct thing than a world-breaking bug.

rainwoodman · 2021-07-07T00:28:00Z

scipy/spatial/kdtree.py

-        >>> print(dd, ii)
-        [2.         0.14142136] [ 0 13]
+        >>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=1)
+        >>> print(dd, ii, sep='\\n')


Is it more conventional to use r'\n'?

rainwoodman · 2021-07-07T00:30:52Z

scipy/spatial/tests/test_kdtree.py

@@ -1254,16 +1254,15 @@ def test_kdtree_duplicated_inputs(kdtree_type):
        # it shall not divide more than 3 nodes.
        # root left (1), and right (2)
        kdtree = kdtree_type(data, leafsize=1)


What about adding a query test here? Is the new balanced tree build covered by a test?

I've fixed the un-compacted tree to use strict less-than, which means this tree always generates exactly 3 leaves. So, I can just assert that all the indices from the ones are in the first leaf and all the twoes in the second.

sturlamolden · 2021-07-17T18:45:17Z

is this good to merge now? I think it is.

peterbell10 · 2021-07-19T16:35:31Z

Yes, this should be ready to go.

tylerjereddy

Looks like the kdtree experts are +1 to merge, CI is green, and I don't see anything too suspicious from a look through myself.

tylerjereddy · 2021-07-20T03:02:41Z

thanks @peterbell10 @rainwoodman @sturlamolden

KelSolaar · 2021-07-20T19:03:25Z

Thanks guys!

BUG: KDTree balanced_tree is unbalanced for degenerate data

a69718a

peterbell10 added the scipy.spatial label Jul 5, 2021

TST: Make query doctest result unambiguous

de621dc

The point 2.1, 2.9 is equidistant from points 2, 2 and 3, 3 making the k-neighbours query sensitive to the exact tree structure.

peterbell10 requested a review from tylerjereddy as a code owner July 6, 2021 00:24

rainwoodman reviewed Jul 7, 2021

View reviewed changes

peterbell10 added 2 commits July 7, 2021 20:44

FIX: Enfore KDTree nodes use strict less-than

d13a48c

DOC: Use raw string literals

5554d4c

peterbell10 force-pushed the ckdtree-median branch from 13aa953 to 5554d4c Compare July 7, 2021 19:44

tylerjereddy added defect A clear bug or issue that prevents SciPy from being installed or used as expected backport-candidate This fix should be ported by a maintainer to previous SciPy versions. labels Jul 17, 2021

tylerjereddy approved these changes Jul 20, 2021

View reviewed changes

tylerjereddy merged commit 81e7c04 into scipy:master Jul 20, 2021

tylerjereddy added this to the 1.8.0 milestone Jul 20, 2021

tylerjereddy modified the milestones: 1.8.0, 1.7.1 Jul 23, 2021

tylerjereddy mentioned this pull request Jul 23, 2021

MAINT: 1.7.1 backports (round 1) #14466

Merged

tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jul 27, 2021

sturlamolden mentioned this pull request Apr 29, 2024

DOC: spatial: cKDTree: clarify why sliding is applied when using median as pivot #20605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

Uh oh!

peterbell10 commented Jul 5, 2021

Uh oh!

sturlamolden commented Jul 6, 2021 •

edited

Loading

Uh oh!

peterbell10 commented Jul 6, 2021 •

edited

Loading

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

sturlamolden commented Jul 6, 2021 •

edited

Loading

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

rainwoodman Jul 7, 2021

Uh oh!

rainwoodman Jul 7, 2021

Uh oh!

peterbell10 Jul 7, 2021 •

edited

Loading

Uh oh!

sturlamolden commented Jul 17, 2021

Uh oh!

peterbell10 commented Jul 19, 2021

Uh oh!

tylerjereddy left a comment

Uh oh!

tylerjereddy commented Jul 20, 2021

Uh oh!

KelSolaar commented Jul 20, 2021

Uh oh!

Uh oh!

Uh oh!

BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

Uh oh!

Conversation

peterbell10 commented Jul 5, 2021

Reference issue

What does this implement/fix?

Additional information

Uh oh!

sturlamolden commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

sturlamolden commented Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

peterbell10 commented Jul 6, 2021

Uh oh!

rainwoodman Jul 7, 2021

Choose a reason for hiding this comment

Uh oh!

rainwoodman Jul 7, 2021

Choose a reason for hiding this comment

Uh oh!

peterbell10 Jul 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sturlamolden commented Jul 17, 2021

Uh oh!

peterbell10 commented Jul 19, 2021

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented Jul 20, 2021

Uh oh!

KelSolaar commented Jul 20, 2021

Uh oh!

Uh oh!

sturlamolden commented Jul 6, 2021 •

edited

Loading

peterbell10 commented Jul 6, 2021 •

edited

Loading

sturlamolden commented Jul 6, 2021 •

edited

Loading

peterbell10 Jul 7, 2021 •

edited

Loading