Typical Mixed Precision Training
Working with Unscaled Gradients
Working with Multiple Models, Losses, and Optimizers
DataParallel in a single process
DistributedDataParallel, one GPU per process
DistributedDataParallel, multiple GPUs per process
Autocast and Custom Autograd Functions