-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(alerts): Add issue summary to slack issue alerts #88033
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, defer to alerts team to review
src/sentry/features/temporary.py
Outdated
# Enables automatically triggering issue summary on alerts | ||
manager.add("projects:trigger-issue-summary-on-alerts", ProjectFeature, FeatureHandlerStrategy.FLAGPOLE, api_expose=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a project level flag so I can just try it out on Seer before releasing to internal org
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this to be exposed in the API to the frontend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, set it to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also please get a screenshot of how the notification looks?
https://develop.sentry.dev/integrations/slack/ - docs on how to set it up
try: | ||
with concurrent.futures.ThreadPoolExecutor() as executor: | ||
future = executor.submit(get_issue_summary, self.group) | ||
summary_result, status_code = future.result(timeout=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we refactor the timeout magic number into a constant and perhaps add metrics/span here so we can monitor how long this is taking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead define a new sentry option that allows us to modify this timeout on the fly if we need to adjust it. It'll be faster than waiting for a full rollout if something goes wrong.
try: | ||
with concurrent.futures.ThreadPoolExecutor() as executor: | ||
future = executor.submit(get_issue_summary, self.group) | ||
summary_result, status_code = future.result(timeout=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we instead define a new sentry option that allows us to modify this timeout on the fly if we need to adjust it. It'll be faster than waiting for a full rollout if something goes wrong.
return summary_result | ||
return None | ||
except (concurrent.futures.TimeoutError, Exception) as e: | ||
logger.exception("Error generating issue summary: %s", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want metrics to track these failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a span and the Sentry issue generated from this sufficient? I'm not too knowledgeable about other ways to track metrics here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fine for now. If we want more granular tracking of things like duration over time, success/failure count, we tend to wrap calls like these in metrics decorators like this for us to graph in Datadog, assign to SLOs, etc:
sentry/src/sentry/tasks/groupowner.py
Line 57 in c284a68
with metrics.timer("sentry.tasks.process_suspect_commits.process_loop"): |
This is totally optional though.
return None | ||
|
||
try: | ||
with concurrent.futures.ThreadPoolExecutor() as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this pattern something seer uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. We use threading a lot in Seer but doesn't seem relevant here. I just thought this would be a simple way to set a timeout, since we said we wanted a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of starting an async thread here and synchronously waiting for it vs passing it as a parameter to get_issue_summary
util function instead? Seems like we're just using a request
under the hood which can already handle this concern:
sentry/src/sentry/seer/issue_summary.py
Line 99 in a05d27f
response = requests.post( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with the timeout is not just the Seer call, but also the queries we're running to get trace-connected issues on the Sentry backend. In my experience those are often the main culprit for slow summary generation. Is there a better way to wrap both of those in a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that's fair. I'm not super familiar with eventstore, but this approach makes sense if we're trying to cover those under the same 5s timeout.
Codecov ReportAttention: Patch coverage is ✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## master #88033 +/- ##
==========================================
- Coverage 87.71% 87.71% -0.01%
==========================================
Files 9977 9978 +1
Lines 564630 564726 +96
Branches 22232 22232
==========================================
+ Hits 495292 495375 +83
- Misses 68922 68935 +13
Partials 416 416 |
This PR should: - for any issue alert in slack, only for error issues, only if behind the gen-ai-features FF and behind the new project level summary x alerts FF... - fetch the summary for the issue (will hit the cache if it already exists or generate a new one otherwise) - time out at 5 seconds - replace the title and body of the alert with the summary content only if we got a summary successfully --------- Co-authored-by: getsantry[bot] <66042841+getsantry[bot]@users.noreply.github.com>
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
This PR should: