-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
There seems to be a problem with the contrastive loss when using 1 GPU to train, training only works when setting no_insgen=true.
The output is:
Setting up augmentation...
Distributing across 1 GPUs...
Distributing Contrastive Heads across 1 GPUS...
Setting up training phases...
Setting up contrastive training phases...
Exporting sample images...
Initializing logs...
2021-09-18 04:23:26.767334: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Training for 25000 kimg...
Traceback (most recent call last):
File "train.py", line 583, in <module>
main() # pylint: disable=no-value-for-parameter
File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "train.py", line 576, in main
subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
File "train.py", line 421, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/home/katarina/ML/insgen/training/training_loop.py", line 326, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain, cl_phases=cl_phases, D_ema=D_ema, g_fake_cl=not no_cl_on_g, **cl_loss_weight)
File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 156, in accumulate_gradients
loss_Dreal = loss_Dreal + lw_real_cl * self.run_cl(real_img_tmp, real_c, sync, Dphase.module, D_ema, loss_name='D_cl')
File "/home/katarina/ML/insgen/training/contrastive_loss.py", line 71, in run_cl
loss = contrastive_head(logits0, logits1, loss_only=loss_only, update_q=update_q)
File "/home/katarina/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/katarina/ML/insgen/training/contrastive_head.py", line 183, in forward
self._dequeue_and_enqueue(k)
File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/katarina/ML/insgen/training/contrastive_head.py", line 51, in _dequeue_and_enqueue
keys = concat_all_gather(keys)
File "/home/katarina/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/katarina/ML/insgen/training/contrastive_head.py", line 197, in concat_all_gather
for _ in range(torch.distributed.get_world_size())]
File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 748, in get_world_size
return _get_group_size(group)
File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 274, in _get_group_size
default_pg = _get_default_group()
File "/home/katarina/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 358, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Metadata
Metadata
Assignees
Labels
No labels