DistributedDataParallel / Distributed training
How to debug model not training on DDP
- Verify the gradient norms across nodes:
def verify_grad_sync(model):
for network, network_name in (
(model, "your model name"),
# Loop for use cases with GANs / multiple networks wrapped in separate DDPs
):
print("Checking gradients for", network_name)
module = (
network.module if hasattr(network, "module") else network
)
for name, param in module.named_parameters():
if param.grad is None:
continue
local_norm = param.grad.norm()
tensor = local_norm.clone().detach()
if is_distributed():
torch.distributed.all_reduce(
tensor, op=torch.distributed.ReduceOp.SUM
)
avg = tensor / WORLD_SIZE
else:
avg = local_norm
print0(
f"[{network_name}]{name} grad norm (per-rank) {local_norm:.6f}, averaged {avg:.6f}",
)
verify_grad_sync(model)
- If average norm is different than local one, then it means that model is training in different directions on each GPU = definitely does not work ;) Check gradient flows and how stuff is getting updated. It's especially tricky with GANs and training loops where
requires_grad
is needed (or not).
Datasets / Dataloaders
Error received 0 items of ancdata
Occurs when setting num_workers > 0 in DataLoader (i.e on Azure VMs).
# Use
torch.multiprocessing.set_sharing_strategy('file_system')
Source: https://discuss.pytorch.org/t/runtimeerror-received-0-items-of-ancdata/4999/2
Error with shared memory on Kubernetes
Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
Solution: mount volume for /dev/shm
in the Pod / Job:
spec:
volumes:
- name: dshm
emptyDir:
medium: Memory
containers:
- image: image-name #specify your image name here
volumeMounts:
- mountPath: /dev/shm
name: dshm
General
Check if CUDA is enabled
import torch
torch.cuda.is_available()
GPU does not work / GPU not available
- Check CUDA version (
nvcc --version
)nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130
- Install PyTorch with appropriate CUDA version, i.e for
conda
installation:install pytorch torchvision cudatoolkit=10.0 -c pytorch
No matches...