Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gpu metrics to TraceRecord #5930

Closed
wants to merge 5 commits into from

Conversation

FloWuenne
Copy link

Description

This PR implements feature requests from #4286 .

The current implementation adds the following fields to the TraceRecord:

  • gpu_model
  • gpu_mem
  • gpu_driver: 'str'
  • -'%gpu'
  • %gpu_mem
  • avg_gpu_mem

This is accomplished by implementing a new nxf_gpu_watch() in command-trace.txt , similar to nxf_mem_watch(). This allows to access GPU metrics in the Tracefile, by adding the following lines to the nextflow.config:

trace{
fields = 'gpu_model,gpu_mem,gpu_driver,%gpu,%gpu_mem,avg_gpu_mem'
}

Limitations

A current limitation of this approach is that if using Nextflow with an executor (for example local) and a local GPU that is operating system wide, the usage metrics are not task specific but include background processes. For example, on my local Ubuntu machine with an NVIDIA GeForce GTX 1650 even a cpu only task reported some GPU usage, due to background GPU processes utilizing the GPU. This should not be the case when running on instances where GPUs are assigned to specific tasks using accelerator, however I haven't tested this on multi-GPU machines on AWS or other cloud executors.

@FloWuenne FloWuenne requested a review from a team as a code owner March 31, 2025 19:06
@FloWuenne FloWuenne changed the title Gpu metrics record Add gpu metrics to TraceRecord Mar 31, 2025
Copy link

netlify bot commented Mar 31, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 530d9a4
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67eae7ba6c86b00008b2df36
😎 Deploy Preview https://deploy-preview-5930--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@pditommaso
Copy link
Member

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

@jordeu
Copy link
Collaborator

jordeu commented Apr 3, 2025

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

Absolutely. Our own tool will simplify Nextflow bash scripts and allow us to do many more things, like collecting metrics during the process's lifetime.

@FloWuenne
Copy link
Author

Since our own tool for performance metrics still seems a bit away though, do we want to move forward in the meantime with using nvidia-smi here and add more gpu controllers over time or are there major implementations here we want to change?

@fntlnz
Copy link
Collaborator

fntlnz commented Apr 8, 2025

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

Agreed, seeing the broader picture with the ability to drill down to the specific task would be very useful both to optimize pipelines and also to optimize their runtime scheduling.

@pditommaso
Copy link
Member

pditommaso commented Apr 10, 2025

@FloWuenne this is a nice POC, but I fear the shell script is becoming to big/complex. Also I would like to avoid introducing another background process to monitor GPU stats. I talked with @jordeu are we are going to implement a new tracing tool binary compiled taking inspiration from this PR.

@pditommaso pditommaso closed this Apr 10, 2025
@FloWuenne
Copy link
Author

Sounds good @pditommaso! Looking forward to see the new tracing tool soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants