Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(seer grouping): Prepare ingest for multiple Seer results #88619

Merged
merged 3 commits into from
Apr 3, 2025

Conversation

lobsterkatie
Copy link
Member

@lobsterkatie lobsterkatie commented Apr 2, 2025

This refactors get_seer_similar_issues, which is used during ingestion, to allow it to handle multiple Seer matches. (Note that it does not actually change the number of results requested - which is still only 1 - but makes it so that when we do increase that number, we'll be able to handle what comes back.)

Key changes:

  • Pull logic for checking whether a match can be used into a helper, _should_use_seer_match_for_grouping, to be called on each result.

  • Run the results returned from Seer through that function regardless of the hybrid fingerprint status of the incoming event, because the closest Seer match(es) might be hybrid and therefore the same check is needed.

  • Change the grouping.similarity.hybrid_fingerprint_seer_result metric to a grouping.similarity.hybrid_fingerprint_match_check metric, since there will now be multiple instances for a single incoming event, possibly with different values. A new metric in the spirit of the original (one encompassing the entire process for a given event) will be added back in in a follow-up PR.

We can see that these changes don't affect the eventual outcome of the process because the only changes to tests required by this refactor are ones having to do with the metric. (To keep things simple, tests testing the handling of multiple results will be added in a follow-up PR. The point of this PR is simply to do the refactor and show that the new code is equivalent to the old code.)

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 2, 2025
@lobsterkatie lobsterkatie force-pushed the kmclb-prework-for-handling-multiple-seer-results branch from 27e817c to 8214c7c Compare April 2, 2025 20:33
Copy link

codecov bot commented Apr 2, 2025

Codecov Report

Attention: Patch coverage is 97.82609% with 1 line in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/grouping/ingest/seer.py 97.22% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #88619      +/-   ##
==========================================
- Coverage   87.73%   87.73%   -0.01%     
==========================================
  Files       10043    10043              
  Lines      568198   568215      +17     
  Branches    22303    22303              
==========================================
+ Hits       498535   498545      +10     
- Misses      69259    69266       +7     
  Partials      404      404              

@lobsterkatie lobsterkatie force-pushed the kmclb-handle-multiple-seer-results-in-ingest branch from 2a504d6 to 4656c81 Compare April 2, 2025 20:33
@lobsterkatie lobsterkatie force-pushed the kmclb-prework-for-handling-multiple-seer-results branch from 8214c7c to 891aeff Compare April 2, 2025 20:41
@lobsterkatie lobsterkatie force-pushed the kmclb-handle-multiple-seer-results-in-ingest branch from 4656c81 to 459a950 Compare April 2, 2025 20:41
@lobsterkatie lobsterkatie force-pushed the kmclb-prework-for-handling-multiple-seer-results branch from 891aeff to 1a0edee Compare April 3, 2025 04:01
@lobsterkatie lobsterkatie force-pushed the kmclb-handle-multiple-seer-results-in-ingest branch from 459a950 to 8636e58 Compare April 3, 2025 04:01
Base automatically changed from kmclb-prework-for-handling-multiple-seer-results to master April 3, 2025 16:05
@lobsterkatie lobsterkatie force-pushed the kmclb-handle-multiple-seer-results-in-ingest branch from 8636e58 to ceff3c0 Compare April 3, 2025 16:10
@lobsterkatie lobsterkatie marked this pull request as ready for review April 3, 2025 16:42
@lobsterkatie lobsterkatie requested a review from a team as a code owner April 3, 2025 16:42
@lobsterkatie lobsterkatie merged commit f13e034 into master Apr 3, 2025
48 checks passed
@lobsterkatie lobsterkatie deleted the kmclb-handle-multiple-seer-results-in-ingest branch April 3, 2025 16:45
lobsterkatie added a commit that referenced this pull request Apr 3, 2025
This adds new metrics to `get_seer_similar_issues`, so we'll be able to see the affects of requesting multiple results from Seer during ingest once we start doing that. Metrics added:

- `grouping.similarity.seer_results_returned`: Just because we ask Seer for the 100 closest matches (say), it doesn't mean Seer's necessarily going to find 100 which exceed the `should_group` threshold. It's therefore useful to know how many matches Seer is actually finding, so we can get a sense for when increasing the number requested stops making a difference.

- `grouping.similarity.hybrid_fingerprint_results_checked`: This similarly will help us evaluate the number of results we're requesting. If we almost always find a match within the first 50 results, for example, it doesn't make sense to request many more than that, even if they exists. Both this and the metric above are tagged with platform, so we can determine if it would make sense to vary the number of matches requested based on platform.

- `grouping.similarity.get_seer_similar_issues`: This replaces the `grouping.similarity.hybrid_fingerprint_seer_result` metric which was removed in #88619, and tracks the overall result of the `get_seer_similar_issues` call. It's different from the old metric in two ways, though: 1) In the case in which the Seer match(es) is/are rejected, it's not as specific as the old one about the reason, since if Seer returns multiple results, it might be a combo of reasons. 2) It also includes non-hybrid cases. (It includes an `is_hybrid` tag to differentiate one from the other.)

This PR also adds the above data to our logs.
andrewshie-sentry pushed a commit that referenced this pull request Apr 8, 2025
This refactors `get_seer_similar_issues`, which is used during ingestion, to allow it to handle multiple Seer matches. (Note that it does not actually change the number of results requested - which is still only 1 - but makes it so that when we do increase that number, we'll be able to handle what comes back.)

Key changes:

- Pull logic for checking whether a match can be used into a helper, `_should_use_seer_match_for_grouping`, to be called on each result.

- Run the results returned from Seer through that function regardless of the hybrid fingerprint status of the incoming event, because the closest Seer match(es) might be hybrid and therefore the same check is needed.

- Change the `grouping.similarity.hybrid_fingerprint_seer_result` metric to a `grouping.similarity.hybrid_fingerprint_match_check` metric, since there will now be multiple instances for a single incoming event, possibly with different values. A new metric in the spirit of the original (one encompassing the entire process for a given event) will be added back in in a follow-up PR.

We can see that these changes don't affect the eventual outcome of the process because the only changes to tests required by this refactor are ones having to do with the metric. (To keep things simple, tests testing the handling of multiple results will be added in a follow-up PR[1]. The point of this PR is simply to do the refactor and show that the new code is equivalent to the old code.)


[1] #88621
andrewshie-sentry pushed a commit that referenced this pull request Apr 8, 2025
This adds new metrics to `get_seer_similar_issues`, so we'll be able to see the affects of requesting multiple results from Seer during ingest once we start doing that. Metrics added:

- `grouping.similarity.seer_results_returned`: Just because we ask Seer for the 100 closest matches (say), it doesn't mean Seer's necessarily going to find 100 which exceed the `should_group` threshold. It's therefore useful to know how many matches Seer is actually finding, so we can get a sense for when increasing the number requested stops making a difference.

- `grouping.similarity.hybrid_fingerprint_results_checked`: This similarly will help us evaluate the number of results we're requesting. If we almost always find a match within the first 50 results, for example, it doesn't make sense to request many more than that, even if they exists. Both this and the metric above are tagged with platform, so we can determine if it would make sense to vary the number of matches requested based on platform.

- `grouping.similarity.get_seer_similar_issues`: This replaces the `grouping.similarity.hybrid_fingerprint_seer_result` metric which was removed in #88619, and tracks the overall result of the `get_seer_similar_issues` call. It's different from the old metric in two ways, though: 1) In the case in which the Seer match(es) is/are rejected, it's not as specific as the old one about the reason, since if Seer returns multiple results, it might be a combo of reasons. 2) It also includes non-hybrid cases. (It includes an `is_hybrid` tag to differentiate one from the other.)

This PR also adds the above data to our logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants