Skip to content

BUG: KDTree balanced_tree is unbalanced for degenerate data #14355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 20, 2021

Conversation

peterbell10
Copy link
Member

Reference issue

Fixes gh-14074

What does this implement/fix?

The balanced_tree argument is meant to create a more balanced tree by splitting at the median instead of the midpoint of the known bounds. However, the "sliding" part of the sliding midpoint rule was also being applied after choosing the median which results in an unbalanced split for degenerate data. Instead, keeping it as the exact median results in a balanced tree.

Additional information

The comments state the median rule comes from sklearn, and this is equivalent to KDTree code:
https://github.com/scikit-learn/scikit-learn/blob/3ae7c7615343bbd36acece57825d8b0d70fd9da4/sklearn/neighbors/_binary_tree.pxi#L1176-L1181

cc @sturlamolden @rainwoodman

The point 2.1, 2.9 is equidistant from points 2, 2 and 3, 3 making the
k-neighbours query sensitive to the exact tree structure.
@peterbell10 peterbell10 requested a review from tylerjereddy as a code owner July 6, 2021 00:24
@sturlamolden
Copy link
Contributor

sturlamolden commented Jul 6, 2021

I think this PR is logically incorrect. The way our kd-tree works we cannot have the same value on both sides of the pivot (the splitting plane), which can happen in the case of ties. Our kd-tree cannot be exactly balanced in the presencence of ties. This is why we do data swapping after the partial sorting, when in theory all the swapping (partial sorting) should be completed. And then it follows that we need to consider sliding the splitting plane even when using the median as pivot. Otherwise we need to update all query methods to assume we can have equal values on both sides of a splitting plane. IIRC the kd-tree in sklearn does this, whereas ours does not because it started out using sliding midpoint and then added the median later on (mostly because it gave faster queries). I know this is annoying but I think the build method needs to be the way it is in order to be logically correct given what the query methods assume.

@peterbell10
Copy link
Member Author

peterbell10 commented Jul 6, 2021

I think this PR is logically incorrect. The way our kd-tree works we cannot have the same value on both sides of the pivot

If this is true, then the sliding midpoint rule is broken as well. Consider constructing a node where all children are equal, but the tree is built with compact=False so it doesn't inspect the bounds and generate a leaf. All points are going to be on the same side of the midpoint and so the split will be chosen as either just after the first element or just before the last element in the list. Since all points are equal, then the same value must exist on both sides of the pivot.

@peterbell10
Copy link
Member Author

If you agree with that, I should be able to fix both issues.

@sturlamolden
Copy link
Contributor

sturlamolden commented Jul 6, 2021

It was constructed on the promise that
lesser < split <= greater.
You might be correct that the sliding midpoint is broken as well though, but I have never seen it break. Maybe it does not matter for the query functions. Or maybe someone has never constructed a bad enough data set and reported a failure. I am not sure, but certainly it was never intended to allow
lesser <= split <= greater.
But you might be right, perhaps we have two bugs to fix here.

@peterbell10
Copy link
Member Author

You might be correct that the sliding midpoint is broken as well though, but I have never seen it break.

Assuming this is an issue, the chances of it adversely effecting the result of a query is astronomically small. You would need to have the query radius exactly overlap with the point on the split line. In which case, the result is borderline anyway and you would expect query results to be unstable just due to machine rounding.

@peterbell10
Copy link
Member Author

So, this is more of a pedantically correct thing than a world-breaking bug.

>>> print(dd, ii)
[2. 0.14142136] [ 0 13]
>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=1)
>>> print(dd, ii, sep='\\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more conventional to use r'\n'?

@@ -1254,16 +1254,15 @@ def test_kdtree_duplicated_inputs(kdtree_type):
# it shall not divide more than 3 nodes.
# root left (1), and right (2)
kdtree = kdtree_type(data, leafsize=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a query test here? Is the new balanced tree build covered by a test?

Copy link
Member Author

@peterbell10 peterbell10 Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the un-compacted tree to use strict less-than, which means this tree always generates exactly 3 leaves. So, I can just assert that all the indices from the ones are in the first leaf and all the twoes in the second.

@sturlamolden
Copy link
Contributor

is this good to merge now? I think it is.

@tylerjereddy tylerjereddy added defect A clear bug or issue that prevents SciPy from being installed or used as expected backport-candidate This fix should be ported by a maintainer to previous SciPy versions. labels Jul 17, 2021
@peterbell10
Copy link
Member Author

Yes, this should be ready to go.

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the kdtree experts are +1 to merge, CI is green, and I don't see anything too suspicious from a look through myself.

@tylerjereddy tylerjereddy merged commit 81e7c04 into scipy:master Jul 20, 2021
@tylerjereddy
Copy link
Contributor

thanks @peterbell10 @rainwoodman @sturlamolden

@tylerjereddy tylerjereddy added this to the 1.8.0 milestone Jul 20, 2021
@KelSolaar
Copy link

Thanks guys!

@tylerjereddy tylerjereddy modified the milestones: 1.8.0, 1.7.1 Jul 23, 2021
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.spatial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segmentation fault when building cKDTree with Scipy 1.6.3.
5 participants