Skip to content

GPU usage metrics #4286

@shenker

Description

@shenker

New feature

It would be extremely useful if GPU usage metrics were recorded for GPU tasks.

Usage scenario

Using GPU resources efficiently on HPC is often a challenge. For example, basecalling Oxford Nanopore sequencing data using the dorado basecaller often takes quite a bit of tuning to get good performance on HPC, for the following reasons:

  1. Duplex mode makes heavy use of random access over thousands of files, resulting in low GPU utilization if the shared filesystem cannot keep up. Being able to monitor GPU utilization would allow detecting and mitigating this issue.
  2. Dorado exhibits widely different performance on different GPU hardware. HPC nodes often are equipped with heterogenous GPU hardware. When parallelizing dorado jobs, it would be useful to measure the relative performance gaps between different GPU hardware. This information could be used to fine-tune job GPU requirements. (In SLURM or other cluster managers it's usually possible to specify which GPU hardware you're willing to use for a job to, e.g., exclude very old Nvidia architectures that no longer offer acceptable performance for a particular task.)
  3. Dorado is a heavy user of GPU VRAM, and crashes when it runs out. Monitoring VRAM usage would help users tune dorado parameters to optimize performance/VRAM usage and know which GPU hardware to request from the cluster manager.

These are very common issues when running dorado on HPC (there are tons of issues on dorado's bug tracker, see, e.g.: nanoporetech/dorado#68, nanoporetech/dorado#336, nanoporetech/dorado#306). This is just a particular example, I imagine the same basic GPU metrics would be useful for most users running GPU tasks with nextflow.

Suggest implementation

An initial implementation could restrict itself to Nvidia GPUs, since those are overwhelmingly the most important for scientific computing.

Use nvidia-settings to record the following metrics:

  • GPU name(s) (useful in a heterogenous cluster to see what GPU hardware a particular task is running on). The same architecture with different amounts of VRAM should be considered different models (e.g., “A100 40GB” should be distinguished from “A100 80GB”)
  • GPU utilization (average)
  • GPU memory (average used, total available)
  • GPU memory utilization (% of time memory controller was busy)

(A quick google search turned up this list of ways to programmatically grab GPU metrics: https://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda)

It would be especially useful if in the report HTML, there was a way to look at all metrics broken down by GPU hardware. Perhaps a checkbox of GPU hardware names, and as you select or deselect GPU models, the resulting GPU utilization/VRAM plots change.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions