DistributedDataParallel / Distributed training

How to debug model not training on DDP

  1. Verify the gradient norms across nodes:
def verify_grad_sync(model):
    for network, network_name in (
        (model, "your model name"),
        # Loop for use cases with GANs / multiple networks wrapped in separate DDPs
    ):
        print("Checking gradients for", network_name)
        module = (
            network.module if hasattr(network, "module") else network
        )
        for name, param in module.named_parameters():
            if param.grad is None:
                continue
            local_norm = param.grad.norm()
            tensor = local_norm.clone().detach()

            if is_distributed():
                torch.distributed.all_reduce(
                    tensor, op=torch.distributed.ReduceOp.SUM
                )
                avg = tensor / WORLD_SIZE
            else:
                avg = local_norm

            print0(
                f"[{network_name}]{name} grad norm (per-rank) {local_norm:.6f}, averaged {avg:.6f}",
            )

verify_grad_sync(model)
  1. If average norm is different than local one, then it means that model is training in different directions on each GPU = definitely does not work ;) Check gradient flows and how stuff is getting updated. It's especially tricky with GANs and training loops where requires_grad is needed (or not).

Datasets / Dataloaders

Error received 0 items of ancdata

Occurs when setting num_workers > 0 in DataLoader (i.e on Azure VMs).

# Use
torch.multiprocessing.set_sharing_strategy('file_system')

Source: https://discuss.pytorch.org/t/runtimeerror-received-0-items-of-ancdata/4999/2

Error with shared memory on Kubernetes

Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

Solution: mount volume for /dev/shm in the Pod / Job:

spec:
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
  containers:
  - image:  image-name #specify your image name here
    volumeMounts:
      - mountPath: /dev/shm
        name: dshm

General

Check if CUDA is enabled

import torch
torch.cuda.is_available()

GPU does not work / GPU not available

  1. Check CUDA version (nvcc --version)
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2018 NVIDIA Corporation
    Built on Sat_Aug_25_21:08:01_CDT_2018
    Cuda compilation tools, release 10.0, V10.0.130
    
  2. Install PyTorch with appropriate CUDA version, i.e for conda installation: install pytorch torchvision cudatoolkit=10.0 -c pytorch
No matches...