Skip to content

Training does not work with 1 GPU #2

@kata44

Description

@kata44

There seems to be a problem with the contrastive loss when using 1 GPU to train, training only works when setting no_insgen=true.

The output is:

Setting up augmentation...
Distributing across 1 GPUs...
Distributing Contrastive Heads across 1 GPUS...
Setting up training phases...
Setting up contrastive training phases...
Exporting sample images...
Initializing logs...
2021-09-18 04:23:26.767334: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Training for 25000 kimg...

Traceback (most recent call last):
  File "train.py", line 583, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 576, in main
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
  File "train.py", line 421, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/home/katarina/ML/insgen/training/training_loop.py", line 326, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain, cl_phases=cl_phases, D_ema=D_ema, g_fake_cl=not no_cl_on_g, **cl_loss_weight)
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 156, in accumulate_gradients
    loss_Dreal = loss_Dreal + lw_real_cl * self.run_cl(real_img_tmp, real_c, sync, Dphase.module, D_ema, loss_name='D_cl')
  File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 71, in run_cl
    loss = contrastive_head(logits0, logits1, loss_only=loss_only, update_q=update_q)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 183, in forward
    self._dequeue_and_enqueue(k)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 51, in _dequeue_and_enqueue
    keys = concat_all_gather(keys)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/katarina/ML/insgen/training/contrastive_head.py", line 197, in concat_all_gather
    for _ in range(torch.distributed.get_world_size())]
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
    return _get_group_size(group)
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
    default_pg = _get_default_group()
  File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
    raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions