Mixed precision training¶
In Colossal-AI, we have incorporated different implementations of mixed precision training:
Among these methods, apex.amp is not compatible with tensor parallelism. This is because that tensors are split across devices
in tensor parallelism, thus, it is required to communicate among different processes to check if
nan occurs in the
whole model weights. We modified the torch amp implementation so that it is compatible with tensor parallelism now.
To use mixed precision training, you can easily specify the
fp16 field in the config file to be True. Currently, PyTorch and
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism. We recommend you to use torch amp as it generally
gives better accuracy than naive amp.
The AMP module is designed to be completely modular and can be used independently from other colossalai modules.
If you wish to only use amp in your code base without
colossalai.initialize, you can use
from colossalai.amp import AMP_TYPE # exmaple of using torch amp model, optimizer, criterion = colossalai.amp.convert_to_amp(model, optimizer, criterion, AMP_TYPE.TORCH)
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to
while keeping some operations such as reductions in
fp32. You can configure the gradient scaler in the config file.
from colossalai.amp import AMP_TYPE fp16=dict( mode=AMP_TYPE.TORCH, # below are default values for grad scaler init_scale=2.**16, growth_factor=2.0, backoff_factor=0.5, growth_interval=2000, enabled=True )
For this mode, we rely on the Apex implementation for mixed precision training. We support
this plugin because it allows for finer control on the granularity of mixed precision. For example,
O2 level (optimization level 2)
will keep batch normalization in
The following code block shows a config file for Apex AMP.
from colossalai.amp import AMP_TYPE fp16 = dict( mode=AMP_TYPE.APEX, # below are the default values enabled=True, opt_level='O1', cast_model_type=None, patch_torch_functions=None, keep_batchnorm_fp32=None, master_weights=None, loss_scale=None, cast_model_outputs=None, num_losses=1, verbosity=1, min_loss_scale=None, max_loss_scale=16777216.0 )
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism. This AMP mode will cast all operations into fp16.
The following code block shows a config file for this mode.
from colossalai.amp import AMP_TYPE fp16 = dict( mode=AMP_TYPE.NAIVE, # below are the default values clip_grad=0, log_num_zeros_in_grad=False, initial_scale=2 ** 32, min_scale=1, growth_factor=2, backoff_factor=0.5, growth_interval=1000, hysteresis=2 )