Add gpu metrics to TraceRecord #5930

FloWuenne · 2025-03-31T19:06:32Z

Description

This PR implements feature requests from #4286 .

The current implementation adds the following fields to the TraceRecord:

gpu_model
gpu_mem
gpu_driver: 'str'
-'%gpu'
%gpu_mem
avg_gpu_mem

This is accomplished by implementing a new nxf_gpu_watch() in command-trace.txt , similar to nxf_mem_watch(). This allows to access GPU metrics in the Tracefile, by adding the following lines to the nextflow.config:

trace{
fields = 'gpu_model,gpu_mem,gpu_driver,%gpu,%gpu_mem,avg_gpu_mem'
}

Limitations

A current limitation of this approach is that if using Nextflow with an executor (for example local) and a local GPU that is operating system wide, the usage metrics are not task specific but include background processes. For example, on my local Ubuntu machine with an NVIDIA GeForce GTX 1650 even a cpu only task reported some GPU usage, due to background GPU processes utilizing the GPU. This should not be the case when running on instances where GPUs are assigned to specific tasks using accelerator, however I haven't tested this on multi-GPU machines on AWS or other cloud executors.

…-trace.txt code.

netlify · 2025-03-31T19:08:50Z

✅ Deploy Preview for nextflow-docs-staging ready!

Name	Link
🔨 Latest commit	`530d9a4`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/67eae7ba6c86b00008b2df36
😎 Deploy Preview	https://deploy-preview-5930--nextflow-docs-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pditommaso · 2025-04-02T10:59:41Z

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

jordeu · 2025-04-03T04:19:20Z

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

Absolutely. Our own tool will simplify Nextflow bash scripts and allow us to do many more things, like collecting metrics during the process's lifetime.

FloWuenne · 2025-04-03T21:08:41Z

Since our own tool for performance metrics still seems a bit away though, do we want to move forward in the meantime with using nvidia-smi here and add more gpu controllers over time or are there major implementations here we want to change?

fntlnz · 2025-04-08T05:44:20Z

This looks a nice step forward, I'm start wondering if it's not time we write our own tool to trace these metrics (including gpu). @jordeu @fntlnz thoughts?

Agreed, seeing the broader picture with the ability to drill down to the specific task would be very useful both to optimize pipelines and also to optimize their runtime scheduling.

pditommaso · 2025-04-10T16:21:46Z

@FloWuenne this is a nice POC, but I fear the shell script is becoming to big/complex. Also I would like to avoid introducing another background process to monitor GPU stats. I talked with @jordeu are we are going to implement a new tracing tool binary compiled taking inspiration from this PR.

FloWuenne · 2025-04-10T17:34:04Z

Sounds good @pditommaso! Looking forward to see the new tracing tool soon!

FloWuenne added 5 commits March 26, 2025 03:44

Added stable GPU metrics to TraceRecord.

4e2e3a0

Add: First fully working version of GPU metrics.

d9b3ba8

Renamed gpu metric variables for consistency.

06d8baa

Added new gpu metrics to docs.

efe1e05

Finalized GPU memory reporting in TraceRecord and reorganized command…

530d9a4

…-trace.txt code.

FloWuenne requested a review from a team as a code owner March 31, 2025 19:06

FloWuenne changed the title ~~Gpu metrics record~~ Add gpu metrics to TraceRecord Mar 31, 2025

pditommaso force-pushed the master branch from f6a3696 to 49b58d2 Compare April 9, 2025 16:18

pditommaso closed this Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu metrics to TraceRecord #5930

Add gpu metrics to TraceRecord #5930

FloWuenne commented Mar 31, 2025

netlify bot commented Mar 31, 2025 •

edited

Loading

pditommaso commented Apr 2, 2025

jordeu commented Apr 3, 2025

FloWuenne commented Apr 3, 2025

fntlnz commented Apr 8, 2025

pditommaso commented Apr 10, 2025 •

edited

Loading

FloWuenne commented Apr 10, 2025

Add gpu metrics to TraceRecord #5930

Add gpu metrics to TraceRecord #5930

Conversation

FloWuenne commented Mar 31, 2025

Description

Limitations

netlify bot commented Mar 31, 2025 • edited Loading

✅ Deploy Preview for nextflow-docs-staging ready!

pditommaso commented Apr 2, 2025

jordeu commented Apr 3, 2025

FloWuenne commented Apr 3, 2025

fntlnz commented Apr 8, 2025

pditommaso commented Apr 10, 2025 • edited Loading

FloWuenne commented Apr 10, 2025

netlify bot commented Mar 31, 2025 •

edited

Loading

pditommaso commented Apr 10, 2025 •

edited

Loading