colossalai.utils.checkpoint(function, *args)

Checkpoint the computation while preserve the rng states, modified from Pytorch torch.utils.checkpoint

  • function – describe the forward pass function. It should know how to handle the input tuples.

  • args – tuple containing inputs to the function


Output of running function on *args

colossalai.utils.print_rank_0(msg, logger=None)

Print messages and save logs(optional). This is executed only if you are the rank-0 gpu.

  • msg – A str message to output

  • logger – python logger object, defaults to None


Make sure data parameters are consistent during Data Parallel Mode


model – A pyTorch nn.model on whose parameters you check the consistency

colossalai.utils.clip_grad_norm_fp32(parameters, max_norm, norm_type=2)
Clips gradient norm of an iterable of parameters whose gradients

are in fp32.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place.

  • parameters ((Iterable[Tensor] or Tensor)) – an iterable of Tensors or a single Tensor that will have gradients normalized

  • max_norm (float or int) – max norm of the gradients

  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.


Total norm of the parameters (viewed as a single vector).

Return type



Returns the index of a currently selected device (gpu/cpu).


Similar to cuda.synchronize(). Waits for all kernels in all streams on a CUDA device to complete.


Similar to cuda.empty_cache() Releases all unoccupied cached memory currently held by the caching allocator.


Send model to gpu.


models – nn.module or a list of module

colossalai.utils.report_memory_usage(message, logger=None, report_cpu=False)

Calculate and print RAM usage (in GB)


EnvironmentError – raise error if no distributed environment has been initialized

class colossalai.utils.Timer

A timer object which helps to log the execution times, and provides different tools to assess the times.


Fisrtly synchronize cuda, reset the clock and then start the timer.


Stop the timer and record the start-stop time interval. :param keep_in_history: whether does it record into history each start-stop interval, defaults to False :type keep_in_history: bool, optional :return: start-stop interval :rtype: int


mean of all history start-stop time intervals. :return: mean of time intervals :rtype: int


add up all the start-stop time intervals. :return: sum of time intervals :rtype: int


return the last start-stop time interval. use it only when timer is not in progress :return: the last time interval :rtype: int


clear up the timer and its history

class colossalai.utils.MultiTimer(on=True)

An object contains multiple timers


on (bool) – whether the timer is enabled. Default is True


Start namely one of the timers :param name: timer’s key :type name: str

stop(name, keep_in_history)

Stop namely one of the timers. :param name: timer’s key :param keep_in_history: whether does it record into history each start-stop interval :type keep_in_history: bool


Get timer by its name (from multitimer) :param name: timer’s key :return: timer with the name you give correctly :rtype: Timer


Reset timers. :param name: if name is designated, the named timer will be reset and others will not, defaults to None

colossalai.utils.accumulate_gradient(model, optimizer, dataloader, accumulate_size, gradient_handlers=None, lr_scheduler=None)
  • model (torch.nn.Module) – your model object

  • optimizer (torch.optim.Optimizer) – your optimizer object

  • dataloader (Iterable) – your dataloader object

  • accumulate_size (int) – the number of steps to accumulate gradients

  • gradient_handlers (List[colossalai.engine.BaseGradientHandler]) – list of gradient handler objects. Default is None

  • lr_scheduler (torch.optim.lr_scheduler._LRScheduler) – your lr scheduler object. Default is None

class colossalai.utils.DataParallelSampler(dataset, shuffle=False, seed=0, drop_last=False)

A data sampler for distributed data parallelism

  • dataset ( – a Dataset instance

  • shuffle (bool, optional) – whether to shuffle data, defaults to False

  • seed (int, optional) – the random seed, defaults to 0

  • drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller, defaults to False


Sets the epoch for this sampler. When shuffle=True, this ensures all replicas use a different random ordering for each epoch. Otherwise, the next iteration of this sampler will yield the same ordering.


epoch (int) – Epoch number.

colossalai.utils.get_dataloader(dataset, shuffle=False, seed=1024, add_sampler=True, drop_last=False, pin_memory=False, num_workers=0, **kwargs)

Set up a deterministic dataloader (also configure seed workers, samplers and whether shuffle or not)

  • dataset ( – a dataset

  • shuffle (bool, optional. Default is False) – whether to shuffle the dataset

  • seed (int, optional. Default is 1024) – random worker seed, defaults to 1024

  • add_sampler (bool, optional. Default is True) – add DistributedDataParallelSampelr to the dataset

  • drop_last (bool, optional. Default is False) – drop the last incomplete batch of data

  • pin_memory (bool, optional. Default is False) – whether to pin memory address in CPU memory

  • num_workers (int, optional. Default is 0) – number of worker threads for this dataloader


a object of

Return type