-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
BUG: KDTree balanced_tree is unbalanced for degenerate data #14355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The point 2.1, 2.9 is equidistant from points 2, 2 and 3, 3 making the k-neighbours query sensitive to the exact tree structure.
I think this PR is logically incorrect. The way our kd-tree works we cannot have the same value on both sides of the pivot (the splitting plane), which can happen in the case of ties. Our kd-tree cannot be exactly balanced in the presencence of ties. This is why we do data swapping after the partial sorting, when in theory all the swapping (partial sorting) should be completed. And then it follows that we need to consider sliding the splitting plane even when using the median as pivot. Otherwise we need to update all query methods to assume we can have equal values on both sides of a splitting plane. IIRC the kd-tree in sklearn does this, whereas ours does not because it started out using sliding midpoint and then added the median later on (mostly because it gave faster queries). I know this is annoying but I think the build method needs to be the way it is in order to be logically correct given what the query methods assume. |
If this is true, then the sliding midpoint rule is broken as well. Consider constructing a node where all children are equal, but the tree is built with |
If you agree with that, I should be able to fix both issues. |
It was constructed on the promise that |
Assuming this is an issue, the chances of it adversely effecting the result of a query is astronomically small. You would need to have the query radius exactly overlap with the point on the split line. In which case, the result is borderline anyway and you would expect query results to be unstable just due to machine rounding. |
So, this is more of a pedantically correct thing than a world-breaking bug. |
scipy/spatial/kdtree.py
Outdated
>>> print(dd, ii) | ||
[2. 0.14142136] [ 0 13] | ||
>>> dd, ii = tree.query([[0, 0], [2.2, 2.9]], k=1) | ||
>>> print(dd, ii, sep='\\n') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it more conventional to use r'\n'
?
scipy/spatial/tests/test_kdtree.py
Outdated
@@ -1254,16 +1254,15 @@ def test_kdtree_duplicated_inputs(kdtree_type): | |||
# it shall not divide more than 3 nodes. | |||
# root left (1), and right (2) | |||
kdtree = kdtree_type(data, leafsize=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a query test here? Is the new balanced tree build covered by a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed the un-compacted tree to use strict less-than, which means this tree always generates exactly 3 leaves. So, I can just assert that all the indices from the ones are in the first leaf and all the twoes in the second.
13aa953
to
5554d4c
Compare
is this good to merge now? I think it is. |
Yes, this should be ready to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the kdtree experts are +1 to merge, CI is green, and I don't see anything too suspicious from a look through myself.
Thanks guys! |
Reference issue
Fixes gh-14074
What does this implement/fix?
The
balanced_tree
argument is meant to create a more balanced tree by splitting at the median instead of the midpoint of the known bounds. However, the "sliding" part of the sliding midpoint rule was also being applied after choosing the median which results in an unbalanced split for degenerate data. Instead, keeping it as the exact median results in a balanced tree.Additional information
The comments state the median rule comes from sklearn, and this is equivalent to KDTree code:
https://github.com/scikit-learn/scikit-learn/blob/3ae7c7615343bbd36acece57825d8b0d70fd9da4/sklearn/neighbors/_binary_tree.pxi#L1176-L1181
cc @sturlamolden @rainwoodman