Lazy initialization

Prerequisite:

Train with booster

Introduction

Lazy initialization defers model initialization. It saves memory when initializing large models.

If your model has N billion parameters and your memory (or GPU memory) is M GB, we recommend you use lazy initialization when 4N >= M. Otherwise, it is optional.

Usage

Lazy initialization must be used with booster.

API reference

class

colossalai.lazy.LazyInitContext

(tensor_cls: typing.Union[colossalai.lazy.lazy_init._MyTensor, colossalai.lazy.lazy_init.LazyTensor] = <class 'colossalai.lazy.lazy_init.LazyTensor'>, default_device: typing.Union[str, torch.device, int, NoneType] = None)

Parameters

tensor_cls (Union[_MyTensor, LazyTensor], optional) -- This is only for test. Defaults to LazyTensor.
default_device (Optional[Union[torch.device, str, int]], optional) -- Defalt device for initialization. If it's cuda, initilization will be accelerated, but cuda memory will be allocated. By default, it's cpu. Defaults to None.

Description

Context manager for lazy initialization. Enables initializing the model without allocating real memory.

function

materialize

(module: Module, verbose: bool = False)

Parameters

module (nn.Module) -- Target nn.Module
verbose (bool) -- Whether to print lazy initialization rate. Defaults to False.

Description

Initialize all `Parameter` from `LazyTensor`. This function will modify the module in-place.

Example

import colossalai
from colossalai.lazy import LazyInitContext
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin

from transformers import LlamaForCausalLM, LlamaConfig, BertForPreTraining

colossalai.launch({})
plugin = GeminiPlugin()
booster = Booster(plugin)

# 1. Initialize model from scratch
# Initialization on cuda will accelerate the initialization process but take more GPU memory.
with LazyInitContext(default_device="cuda"):
    model = LlamaForCausalLM(LlamaConfig(hidden_size=64, intermediate_size=172, num_hidden_layers=4, num_attention_heads=4))
model, *_ = booster.boost(model)

# 2. Initialize model from pretrained
with LazyInitContext():
    model = BertForPreTraining.from_pretrained("prajjwal1/bert-tiny")
model, *_ = booster.boost(model)

⚠️ Lazy initialization from pretrained is supported for colossalai>0.3.3 or main branch.

Limitations

As we claimed, lazy initialization must be used with booster. And only several plugins support it.

Plugin	Supported	Remarks
Gemini	Yes
Hybrid Parallel	Yes
Low Level Zero	No	No need
Torch DDP	No	Incompatible
Torch FSDP	No	Incompatible

Not all models can be lazily initialized. In some cases, a part of parameters/buffers may be early initialized. But don't worry, this part usually takes a small proportion of the whole model.

And some models are not supported at all which will raise an error. We tested models in torchvision, diffusers, timm, transformers, torchaudio and torchrec. Below models are not supported:

Model	Category
wav2vec2_base	torchaudio
hubert_base	torchaudio
ViTModel	transformers
ViTForMaskedImageModeling	transformers
ViTForImageClassification	transformers
Blip2Model	transformers
Blip2ForConditionalGeneration	transformers

Lazy initialization

Introduction​

Usage​

API reference​