-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpu metrics to TraceRecord #5930
Conversation
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Absolutely. Our own tool will simplify Nextflow bash scripts and allow us to do many more things, like collecting metrics during the process's lifetime. |
Since our own tool for performance metrics still seems a bit away though, do we want to move forward in the meantime with using nvidia-smi here and add more gpu controllers over time or are there major implementations here we want to change? |
Agreed, seeing the broader picture with the ability to drill down to the specific task would be very useful both to optimize pipelines and also to optimize their runtime scheduling. |
@FloWuenne this is a nice POC, but I fear the shell script is becoming to big/complex. Also I would like to avoid introducing another background process to monitor GPU stats. I talked with @jordeu are we are going to implement a new tracing tool binary compiled taking inspiration from this PR. |
Sounds good @pditommaso! Looking forward to see the new tracing tool soon! |
Description
This PR implements feature requests from #4286 .
The current implementation adds the following fields to the TraceRecord:
This is accomplished by implementing a new
nxf_gpu_watch()
incommand-trace.txt
, similar tonxf_mem_watch()
. This allows to access GPU metrics in the Tracefile, by adding the following lines to the nextflow.config:Limitations
A current limitation of this approach is that if using Nextflow with an executor (for example local) and a local GPU that is operating system wide, the usage metrics are not task specific but include background processes. For example, on my local Ubuntu machine with an
NVIDIA GeForce GTX 1650
even a cpu only task reported some GPU usage, due to background GPU processes utilizing the GPU. This should not be the case when running on instances where GPUs are assigned to specific tasks using accelerator, however I haven't tested this on multi-GPU machines on AWS or other cloud executors.