colossalai.initialize

colossalai.initialize.get_default_parser()

Reads user command line and uses an argument parser to parse the input arguments. Input arguments include configuration, host, port, world size, local rank, backend for torch.distributed.

Returns

returns the parser with the default arguments, the user may add customized arguments into this parser

Return type

Namespace

colossalai.initialize.launch(config, rank, world_size, host, port, backend='nccl', local_rank=None, seed=1024, verbose=True)

This function first parses the configuration arguments, using :func:parse_args() in case one of the input arguments are not given. Then initialize and set distributed environment by calling global_context’s functions.

Parameters
  • config (Union[str, dict, Config]) – config file or config file path are both acceptable

  • rank (int) – rank for the default process group

  • world_size (int) – world size of the default process group

  • host (str) – the master address for distributed training

  • port (str) – the master port for distributed training

  • backend (str) – backend for torch.distributed

  • local_rank – rank for the process on the node and is used to set the default CUDA device,

defaults to None. If local_rank = None, the default device ordinal will be calculated automatically :type local_rank: int, optional :param verbose: whether to print logs :type verbose: bool :raises Exception: raise exception when config type is wrong

colossalai.initialize.launch_from_slurm(config, host, port, backend='nccl', seed=1024, verbose=True)

A wrapper for colossalai.launch for SLURM launcher by reading rank and world size from the environment variables set by SLURM

Parameters
  • config (Union[str, dict, Config]) – config file or config file path are both acceptable

  • host (str) – the master address for distributed training

  • port (str) – the master port for distributed training

  • backend (str) – backend for torch.distributed

  • verbose (bool) – whether to print logs

colossalai.initialize.launch_from_openmpi(config, host, port, backend='nccl', seed=1024, verbose=True)

A wrapper for colossalai.launch for OpenMPI launcher by reading rank and world size from the environment variables set by OpenMPI

Parameters
  • config (Union[str, dict, Config]) – config file or config file path are both acceptable

  • host (str) – the master address for distributed training

  • port (str) – the master port for distributed training

  • backend (str) – backend for torch.distributed

  • verbose (bool) – whether to print logs

colossalai.initialize.launch_from_torch(config, backend='nccl', seed=1024, verbose=True)

A wrapper for colossalai.launch for torchrun or torch.distributed.launch by reading rank and world size from the environment variables set by PyTorch

Parameters
  • config (Union[str, dict, Config]) – config file or config file path are both acceptable

  • host (str) – the master address for distributed training

  • port (str) – the master port for distributed training

  • backend (str) – backend for torch.distributed

  • verbose (bool) – whether to print logs

colossalai.initialize.initialize(model, optimizer, criterion, train_dataloader=None, test_dataloader=None, lr_scheduler=None, verbose=True)

Core function to wrap the essential training components with our functionality based on the config which is loaded into gpc.config.

Parameters
  • model (torch.nn.Module) – your model instance

  • optimizer (torch.optim.optimizer.Optimizer) – your optimizer instance

  • criterion (torch.nn.modules.loss._Loss) – your criterion instance

  • train_dataloader (torch.utils.data.DataLoader) – dataloader for training data

  • train_dataloader – dataloader for testing data

  • lr_scheduler (torch.nn.lr_scheduler._LRScheduler) – your lr scheduler instance

  • verbose (bool) – whether to print logs

Returns

(engine, train_dataloader, test_dataloader, lr_scheduler)

Return type

tuple