ZeRO

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.

  1. ZeRO Stage 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.

  2. ZeRO Stage 2: The reduced 16-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.

  3. ZeRO Stage 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

In addition, ZeRO-3 includes the infinity offload engine to form ZeRO-Infinity ([paper](https://arxiv.org/abs/2104.07857)), which can offload all model states to both CPU and NVMe memory for huge memory savings.

For a deep dive of our algorithms, please see our papers on ZeRO, ZeRO-Offload, and ZeRO-Infinity.

Note

DeepSpeed first included offloading capabilities with ZeRO-Offload, a system for offloading optimizer and gradient states to CPU memory within ZeRO-2. ZeRO-Infinity is the next generation of offloading capabilities, accessible to ZeRO-3. ZeRO-Infinity has all of the savings of ZeRO-Offload, plus is able to offload more the model weights and has more effective bandwidth utilization and overlapping of computation and communication.

Getting Started

If you are new to DeepSpeed, check out our Getting Started page.

Once you are training with DeepSpeed, enabling ZeRO-3 offload is as simple as enabling it in your DeepSpeed configuration! Below are a few examples of ZeRO-3 configurations. Please see our config guide for a complete list of options for configuration and performance tuning.

Note

ZeRO-Infinity and ZeRO-Offload work best with our heavily optimized deepspeed.ops.adam.DeepSpeedCPUAdam optimizer. We recommend using our optimizer config to instruct deepspeed.initialize() to build the optimizer for you.

ZeRO Configurations

All the settings for DeepSpeed ZeRO are set with the DeepSpeedZeroConfig. The dictionary provided under the zero_optimization entry of the main DeepSpeed configuration dict will be parsed and validated with this class. Sub-configurations for parameter offload and optimizer offload settings are parsed by DeepSpeedZeroOffloadParamConfig and DeepSpeedZeroOffloadOptimizerConfig.

class deepspeed.runtime.zero.config.DeepSpeedZeroConfig[source]

Sets parameters for ZeRO optimizations.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

stage: ZeroStageEnum = 0

Chooses different stages of ZeRO Optimizer. Stage 0, 1, 2, and 3 refer to disabled, optimizer state partitioning, and optimizer+gradient state partitioning, and optimizer+gradient+parameter partitioning, respectively.

contiguous_gradients: bool = True

Copies the gradients to a contiguous buffer as they are produced. Avoids memory fragmentation during backward pass.

reduce_scatter: bool = True

Uses reduce or reduce scatter instead of allreduce to average gradients

reduce_bucket_size: int = 500,000,000

Number of elements reduced/allreduced at a time. Limits the memory required for the allgather for large model sizes

Constraints
  • ge = 0

use_multi_rank_bucket_allreduce: bool = True

Combine the reduce buckets of the different ranks and do an All-Reduce instead of multiple Reduce ops. This feature is useful when the model is small and we want to scale it on too many GPUs which therefore reduces the message sizes of each packet.

allgather_partitions: bool = True

Chooses between allgather collective or a series of broadcast collectives to gather updated parameters from all the GPUs at the end of each step

allgather_bucket_size: int = 500,000,000

Number of elements allgathered at a time. Limits the memory required for the allgather for large model sizes

Constraints
  • ge = 0

overlap_comm: Optional[bool] = None

Attempts to overlap the reduction of the gradients with backward computation

load_from_fp32_weights: bool = True

Boolean indicating whether to initialize fp32 master weights from fp32 copies in checkpoint (no precision loss) or from model’s fp16 copies (with precision loss). This can be used to initialize optimizer state even when checkpoint is missing optimizer state.

elastic_checkpoint: bool = False

Enable loading checkpoint that was saved by job with different GPU count. No longer supported.

offload_param: Optional[DeepSpeedZeroOffloadParamConfig] = None

Enable offloading of model parameters to CPU or NVMe. This frees up GPU memory for larger models or batch sizes. Valid only with stage 3. Expects a dictionary containing values for DeepSpeedZeroOffloadParamConfig.

offload_optimizer: Optional[DeepSpeedZeroOffloadOptimizerConfig] = None

Enable offloading of optimizer state to CPU or NVMe, and optimizer computation to CPU. This frees up GPU memory for larger models or batch sizes. Valid for ZeRO stage 1, 2, 3. Expects a dictionary containing values for DeepSpeedZeroOffloadOptimizerConfig.

sub_group_size: int = 1,000,000,000

Tile size for parameter processing to fit massive models (with trillions of parameters). Used by ZeRO3-Offload and ZeRO-Infinity

Constraints
  • ge = 0

cpu_offload_param: Optional[bool] = None

Deprecated, please use offload_param

cpu_offload_use_pin_memory: Optional[bool] = None

Deprecated, please use offload_param or offload_optimizer

cpu_offload: Optional[bool] = None

Deprecated, please use offload_optimizer

prefetch_bucket_size: int = 50,000,000 (alias 'stage3_prefetch_bucket_size')

Maximum number of parameter elements to fetch ahead of use. Used by ZeRO3, ZeRO3-Offload, ZeRO-Infinity, and ZeRO-Inference.

Constraints
  • ge = 0

param_persistence_threshold: int = 100,000 (alias 'stage3_param_persistence_threshold')

Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).

Constraints
  • ge = 0

model_persistence_threshold: int = sys.maxsize (alias 'stage3_model_persistence_threshold')

Maximum number of parameter elements that can be persisted in GPU and not partitioned. This imposes an upper bound on the number of unpartitioned parameters resulting from param_persistence_threshold setting. Used by ZeRO3-Offload, ZeRO-Infinity and ZeRO-Inference.

Constraints
  • ge = 0

max_live_parameters: int = 1,000,000,000 (alias 'stage3_max_live_parameters')

The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication.

Constraints
  • ge = 0

max_reuse_distance: int = 1,000,000,000 (alias 'stage3_max_reuse_distance')

Do not release a parameter if it will be reused within this threshold of parameters. Smaller values use less memory, but perform more communication.

Constraints
  • ge = 0

gather_16bit_weights_on_model_save: bool = False (alias 'stage3_gather_16bit_weights_on_model_save')

Consolidate the weights before saving the model by save_16bit_model(). Since the weights are partitioned across GPUs, they aren’t part of state_dict, so this function automatically gathers the weights when this option is enabled and then saves the fp16 model weights.

module_granularity_threshold: int = 0 (alias 'stage3_module_granularity_threshold')

The granularity of a module is determined by the ratio of “parameter_count / (1 + descendant count)”. ZeRO3 classifies modules with a granularity below the threshold as fine-grained, which are treated as integral units during parameter fetching. This reduces host overhead and the separate allgather overhead introduced by hooks for fine-grained layers when fetching parameters.

use_all_reduce_for_fetch_params: bool = False (alias 'stage3_use_all_reduce_for_fetch_params')

Use all_reduce op when fetching module parameters at stage3. This improves performance by reducing the overhead of concatenation and slicing on the host.

stage3_gather_fp16_weights_on_model_save: bool = False

Deprecated, please use gather_16bit_weights_on_model_save

ignore_unused_parameters: bool = True

Unused parameters in modules may be unexpected in static networks, but could be normal in dynamic networks. This controls whether or not training should terminate with an error message when unused parameters are detected. This is set to True by default, which means unused parameters are ignored and training continues. Now is just used in stage 2.

legacy_stage1: bool = False

For backward-compatibility enable old ZeRO stage 1 implementation. Use at your own risk, will be deprecated soon.

round_robin_gradients: bool = False

Stage 1 and 2 optimization for CPU offloading that parallelizes gradient copying to CPU memory among ranks by fine-grained gradient partitioning. Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism).

zero_hpz_partition_size: int = 1

Number of ranks in zero parameters partitioning secondary group

Constraints
  • ge = 0

zero_quantized_weights: bool = False

Boolean indicating whether to quantize zero parameters (weights) for efficient all_gather comm

zero_quantized_nontrainable_weights: bool = False

Boolean indicating whether to quantize non-trainable zero parameters (weights) for efficient memory usage and communication. Different from zero_quantized_weights that stores the weights in original precision and only perform quantization during communication, this flag will store the weights in quantized precision. This is useful for LoRA training.

zero_quantized_gradients: bool = False

Boolean indicating whether to use quantized zero gradients for efficient all_2_all_reduce comm

zeropp_loco_param: Optional[Dict[str, Any]] = None

This dictionary contains parameters for using LoCo-Zero++, with two key parameters: - err_beta: A coefficient for the moving average of quantization errors before and after gradient computation. It ranges between 0 and 1, with a default value of 0.8. - reset_T: The number of steps after which the moving-average error buffer is cleared. The default value is 1024. These parameters can be adjusted based on performance needs. Example configuration in ds config: “zeropp_loco_param”: { “err_beta”: 0.8, “reset_T”: 1024 }. See LoCo paper for more details: (https://arxiv.org/abs/2407.04480).

mics_shard_size: int = -1
mics_hierarchical_params_gather: bool = False
memory_efficient_linear: bool = True

Use memory efficient linear implementation, for Stage 3.

pipeline_loading_checkpoint: bool = False
override_module_apply: bool = True

Override nn.Module apply function, for Stage 3.

log_trace_cache_warnings: bool = False

Whether to log warnings from trace cache, such as invalidation events.

class deepspeed.runtime.zero.config.DeepSpeedZeroOffloadParamConfig[source]

Set options for parameter offload. Valid only with stage 3.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

device: OffloadDeviceEnum = 'none'

Device memory to offload model parameters. Supported options are cpu and nvme.

nvme_path: Optional[Path] = None

Filesystem path for NVMe device for parameter offloading.

buffer_count: int = 5

Number of buffers in buffer pool for parameter offloading to NVMe.

Constraints
  • ge = 0

buffer_size: int = 100,000,000

Size of buffers in buffer pool for parameter offloading to NVMe.

Constraints
  • ge = 0

max_in_cpu: int = 1,000,000,000

Number of parameter elements to maintain in CPU memory when offloading to NVMe is enabled.

Constraints
  • ge = 0

pin_memory: bool = False

Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead.

class deepspeed.runtime.zero.config.DeepSpeedZeroOffloadOptimizerConfig[source]

Set options for optimizer offload. Valid with stage 1, 2, and 3.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

device: OffloadDeviceEnum = 'none'

Device memory to offload optimizer state. Supported options are cpu and nvme. Optimizer computation is offload to CPU regardless of device option.

nvme_path: Optional[Path] = None

Filesystem path for NVMe device for optimizer state offloading.

buffer_count: int = 4

Number of buffers in buffer pool for optimizer state offloading to NVMe. This should be at least the number of states maintained per parameter by the optimizer. For example, Adam optimizer has 4 states (parameter, gradient, momentum, and variance).

Constraints
  • ge = 0

pin_memory: bool = False

Offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead.

pipeline_read: bool = False

For tile-based optimizer step processing, overlap read of next tile with computation of current tile. Used in ZeRO-Infinity.

pipeline_write: bool = False

For tile-based optimizer step processing, overlap write of previous tile with computation of current tile.

fast_init: bool = False

Enable fast optimizer initialization when offloading to NVMe.

ratio: float = 1.0

Percentage of offloaded optimizer states to CPU Adam. Only valid with ZeRO Stage 3.

Constraints
  • ge = 0.0

  • le = 1.0

Example ZeRO-3 Configurations

  1. Use ZeRO to partition the optimizer states (stage 1), gradients (stage 2), and parameters (stage 3).

    {
        "zero_optimization": {
            "stage": 3,
        },
        "fp16": {
            "enabled": true
        },
        "optimizer": {
            "type": "AdamW",
            "params": {
            "lr": 0.001,
            "betas": [
                0.8,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 3e-7
            }
        },
        ...
    }
    
  2. Additionally offload the optimizer states and computations to the CPU with ZeRO-Infinity.

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu"
            }
        },
        ...
    }
    
  3. Save even more memory by offloading parameters to the CPU memory.

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu"
            }
            "offload_param": {
                "device": "cpu"
            }
        },
        ...
    }
    
  4. Save even MORE memory by offloading to NVMe (if available on your system):

    {
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "nvme",
                "nvme_path": "/nvme_data"
            }
            "offload_param": {
                "device": "nvme",
                "nvme_path": "/nvme_data"
            }
        },
        ...
    }
    

MiCS Configurations

All MiCS configurations are set with DeepSpeedZeroConfig. MiCS assumes ZeRO stage 3 optimization is enabled. For now, there are two configuration fields of MiCS mics_shard_size and mics_hierarchical_params_gather. mics_shard_size controls how many devices are used for partitioning the model states. mics_hierarchical_params_gather controls whether we use a two-stage hierarchical way to gather parameters in the forward computation. mics_hierarchical_params_gather is useful when model states are partitioned across multiple nodes and the cross-node bandwidth is slow. By default this is turned off.

Example MiCS Configurations

  1. Use MiCS to partition the model states (including optimizer states, gradients, and parameters). The following config example partitions the model states to eight devices, and assumes the eight devices are located within a single node (mics_hierarchical_params_gather is False).

    {
        "zero_optimization": {
            "stage": 3,
            "mics_shard_size": 8,
            "mics_hierarchical_params_gather": False,
        },
        ...
    }
    

Assumptions

DeepSpeed automatically coordinates the collection (i.e., all-gather), partitioning (i.e., scatter), and offloading of parameters at the granularity of (sub)module forward() methods. The backward pass is handled similarly. This strategy has two underlying assumptions:

  1. The forward and backward passes of submodules must individually fit in device memory. If this not the case, deepspeed.zero.TiledLinear implements memory-centric tiling and works with ZeRO-3 to break linear layers into a sequence of smaller submodules that can fit in memory.

  2. A module’s parameters are only accessed within its own __init__ and forward() methods. Otherwise, DeepSpeed must be instructed to collect and re-partition the parameter. See Manual Parameter Coordination for manually coordinating parameters.

Constructing Massive Models

ZeRO-3 enables massive models whose parameters exceed the size of individual nodes in a system. For the typical case of training without model parallelism, you can simply allocate your model in our context:

with deepspeed.zero.Init():
    model = MyLargeModel()
class deepspeed.zero.Init(module=None, data_parallel_group=None, mem_efficient_linear=True, remote_device=None, pin_memory=False, config_dict_or_path=None, config=None, enabled=True, dtype=None, mpu=None, zero_param_parallel_group=None, zero_quantized_weights=False, zero_quantized_nontrainable_weights=False, sequence_data_parallel_group=None, param_swapper=None, tensor_overrides=[<DeepSpeedTensorOverride.dtype: 1>, <DeepSpeedTensorOverride.device: 2>])

A context to enable massive model construction for training with ZeRO-3. Models are automatically partitioned (or, sharded) across the system and converted to half precision.

Parameters
  • module (torch.nn.Module, optional) – If provided, partition the model as if it was constructed in the context.

  • data_parallel_group (deepspeed.comm process group, optional) – The group of processes to partition among. Defaults to all processes. Synonymous with sequence data parallel group for param partitioning across both sequence and data parallel groups.

  • mem_efficient_linear (bool, optional) – Replace torch.nn.functional.linear with an implementation that allows DeepSpeed to partition parameters. Defaults to True.

  • remote_device (string, optional) – The initial device to store model weights e.g., cpu, nvme. Passing "cpu" will create the model in CPU memory. The model may still be moved to GPU based on the offload settings for training. Defaults to param offload device if a config is defined, otherwise GPU.

  • pin_memory (bool, optional) – Potentially increase performance by using pinned memory for model weights. remote_device must be "cpu". Defaults to pin_memory value in config, otherwise False.

  • config_dict_or_path (dict or json file, optional) – If provided, provides configuration for swapping fp16 params to NVMe and other things like dtype.

  • config (dict or json file, optional) – Deprecated, use config_dict_or_path instead.

  • enabled (bool, optional) – If False, this context has no effect. Defaults to True.

  • dtype (dtype, optional) – Can be used to change the data type of the parameters. Supported options are torch.half and torch.float. Defaults to None

  • mpu (object, optional) – A model parallelism unit object that implements get_{model,data}_parallel_{rank,group,world_size}.

  • zero_param_parallel_group (object, optional) – Parallel (comm) group for dual partitioning of ZeRO params.

  • zero_quantized_weights (bool, optional) – If True, turn on quantized weights in all gather weights. Default is False

  • zero_quantized_nontrainable_weights (bool, optional) – If True, nontrainable weights will be stored in quantized format. Default is False

  • param_swapper (deepspeed.runtime.swap_tensor.partitioned_param_swapper.AsyncPartitionedParameterSwapper, optional) – [Experimental] Use existing parameter swapper. Defaults to None. This argument will be removed in the near future.

  • tensor_overrides ([deepspeed.runtime.zero.DeepSpeedTensorOverride], optional) – Tensor attributes to override. Defaults to overriding dtype and device.

This context accelerates model initialization and enables models that are too large to allocate in their entirety in CPU memory. It has the following effects:

  1. allocates tensors to either GPU or CPU memory or NVMe

  2. converts floating point tensors to half precision

  3. immediately partitions tensors among the group of data-parallel devices

  4. (optional) replaces torch.nn.functional.linear with a more memory-efficient implementation

These modifications allow for models that exceed the size of local CPU/GPU memory/NVMe, but fit within the total NVMe capacity (i.e., aggregate CPU or GPU memory or NVMe) across all nodes. Consider initializing a model with one trillion parameters, whose weights occupy two terabytes (TB) in half precision. The initial CPU allocation in full precision requires 4TB of memory per process, and so a system with 8 GPUs per node would need 32TB of CPU memory due to data-parallel redundancies. Instead, by immediately partitioning tensors we remove the redundancies. The result is that regardless of the number of GPUs, we still only require the original 4TB. This allows for a linear increase in model size with the aggregate system memory. For example, if a node has 1TB of memory and 8 GPUs, we could fit a trillion parameter model with 4 nodes and 32 GPUs.

Important: If the fp16 weights of the model can’t fit onto a single GPU memory this feature must be used.

Note

Initializes deepspeed.comm if it has not already been done so. See deepspeed.init_distributed() for more information.

Note

Only applicable to training with ZeRO-3.

Examples

  1. Allocate a model and partition it among all processes:

    with deepspeed.zero.Init():
        model = MyLargeModel()
    
  2. Allocate a model in pinned CPU memory and partition it among a subgroup of processes:

    with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(),
                             remote_device="cpu",
                             pin_memory=True):
        model = MyLargeModel()
    
  3. Partition an already-allocated model in CPU memory:

    model = deepspeed.zero.Init(module=model)
    
get_partition_rank()

subclass can overload to specify different relative rank in parameter partition group

get_dp_process_group()

Return the communication group with all data-parallel ranks

Manual Parameter Coordination

Most models require no modification to be trained with ZeRO-3. However, in some cases one may need to access model weights outside of the training loop, or to share weights across submodules during training. DeepSpeed has several mechanisms to coordinate partitioned weights for ZeRO-3.

Gathering Parameters

DeepSpeed provides mechanisms for collecting (or gathering) a partitioned parameter.

Some models partitioned with deepspeed.zero.Init may need to access a module’s weights outside of the class constructor or its forward() method. We refer to these weights as external parameters, since these parameters are accessed outside of the module that created them. To do so, use deepspeed.zero.GatheredParameters or deepspeed.zero.register_external_parameter().

class deepspeed.zero.GatheredParameters(params, modifier_rank=None, fwd_module=None, enabled=True)

A context that collects parameters that were partitioned via a deepspeed.zero.Init context. The parameters are partitioned again upon exit.

Parameters
  • params (torch.nn.Parameter) – A single parameter, or an iterable of parameters (list, tuple, generator) of parameters to collect. It’s assumed that all parameters are zero params.

  • modifier_rank (int, optional) – If specified, this rank’s parameter will be broadcasted on exit from the context. This argument is required if params are modified, so that all processes have a consistent view of the data. Defaults to None.

  • fwd_module (torch.nn.Module, optional) – If specified, params will be registered as external parameters of fwd_module. See deepspeed.zero.register_external_parameter().

  • enabled (bool, optional) – If False, this context is a no-op. Defaults to True.

Important: Make sure to use modifier_rank that is not None (e.g., modifier_rank=0) if you need the GPU memory allocated by gather to be released upon exit from the context manager.

Important: if params isn’t an iterable of parameters or a single parameter it’ll be silently ignored!

Examples

  1. Allocate a partitioned module, initialize its weight on rank 0, and update all processes.

    with deepspeed.zero.Init():
        linear = torch.nn.Linear(1000,1000)
    
    with deepspeed.zero.GatheredParameters(linear.weight,
                                           modifier_rank=0):
        if deepspeed.comm.get_rank() == 0:
            linear.weight.zero_()
    
    with deepspeed.zero.GatheredParameters(linear.weight,
                                           modifier_rank=0):
        if deepspeed.comm.get_rank() == 0:
            linear.weight.zero_()
    
  2. Collect a partitioned weight to pass to another module during training. The parameter will be registered as an external parameter and made available during the backward pass.

    def forward(self, input):
        x = self.layer1(input)
    
        # self.layer1.weight is required by self.layer2.forward
        with deepspeed.zero.GatheredParameters(self.layer1.weight,
                                               fwd_module=self):
            y = self.layer2(x, self.layer1.weight)
        return y
    
  3. Pretrained model loading

    with deepspeed.zero.Init():
        model = MyModel()
    
    state_dict = torch.load(model_path, map_location="cpu")
    
    def load(module: nn.Module, prefix=""):
        # because zero3 puts placeholders in model params, this context
        # manager gathers (unpartitions) the params of the current layer, then loads from
        # the state dict and then re-partitions them again
        with deepspeed.zero.GatheredParameters(list(module.parameters(recurse=False)), modifier_rank=0):
            if deepspeed.comm.get_rank() == 0:
                module._load_from_state_dict(state_dict, prefix)
    
        for name, child in module._modules.items():
            if child is not None:
                load(child, prefix + name + ".")
    
    load(model, prefix="")
    

If this approach is not used, then the full model will first be copied to each GPU. For models bigger than the memory of a single GPU, this method is required.

Registering External Parameters

ZeRO-3 will automatically collect and partition the model parameters as they are needed during the forward and backward passes. However, in some cases a parameter may be used outside of its module’s forward pass. We call these external parameters. ZeRO-3 can coordinate these parameters if they are registered either automatically or manually.

Note

DeepSpeed version 0.3.15 includes automatic external parameter discovery and registration to support the most common cases. Parameters can still be manually registered if they cannot be automatically detected.

DeepSpeed can automatically detect the following external parameter scenarios:

  1. Parameter access: consider the following pattern common in language models such as GPT:

    The tensor embeddings.weight is used in both embeddings.forward() and compute_logits(). We call embeddings.weight an external parameter because it is used in the training loop outside of its owning module’s forward pass.

    class LanguageModel(torch.nn.Module):
        ...
        def forward(self, inputs):
            embeds = self.embeddings(inputs)
            ...
            logits = compute_logits(output, self.embeddings.weight)
            ...
    
  2. Returning a parameter:

    CustomLinear returns both an output and its own bias parameter. DeepSpeed will detect the external bias parameter and register it with submodules that use CustomLinear.

    class CustomLinear(torch.nn.Linear):
        def forward(self, *input):
            output = super().forward(*input)
            return output, self.bias
    
deepspeed.zero.register_external_parameter(module, parameter)

Instruct DeepSpeed to coordinate parameter’s collection and partitioning in the forward and backward passes of module.

This is used when a parameter is accessed outside of its owning module’s forward(). DeepSpeed must know to collect it from its partitioned state and when to release the memory.

Note

This is only applicable to training with ZeRO stage 3.

Parameters
  • module (torch.nn.Module) – The module that requires parameter in its forward pass.

  • parameter (torch.nn.Parameter) – The parameter to register.

Raises

RuntimeError – If parameter is not of type torch.nn.Parameter.

Examples

  1. Register a weight that is used in another module’s forward pass (line 6). Parameter layer1.weight is used by layer2 (line 11).

     1class ModuleZ3(torch.nn.Module):
     2    def __init__(self, *args):
     3        super().__init__(self, *args)
     4        self.layer1 = SomeLayer()
     5        self.layer2 = OtherLayer()
     6        deepspeed.zero.register_external_parameter(self, self.layer1.weight)
     7
     8    def forward(self, input):
     9        x = self.layer1(input)
    10        # self.layer1.weight is required by self.layer2.forward
    11        y = self.layer2(x, self.layer1.weight)
    12        return y
    

Overriding Module.apply

A convenient mechanism for customizing model initialization is Module.apply. With ZeRO stage 3, Module.apply implementations must account for parameter partitioning by zero.Init during model initialization. The default behavior of ZeRO stage 3 is to automatically handle this issue by overriding Module.apply to ensure that parameters are gathered before access by Module.apply. The benefit of this approach is development convenience, since users are saved the burden of manual parameter coordination in Module.apply. However, the downside is slow model initialization, since all the model parameters (e.g., billions) are gathered even though the common usage of Module.apply is to customize a few parameters. Developers can disable this default behavior by setting the override_module_apply configuration knob to False, for faster model initialization at the cost of manually handling partitioned parameters in their Module.apply implementations.

Memory-Centric Tiling

To reduce the working memory requirements of DL training for large models, ZeRO-Infinity includes technique called memory-centric tiling that exploits the data fetch and release pattern of ZeRO-3 to reduce the working memory requirements by breaking down a large operator into smaller tiles that can be executed sequentially. When combined with ZeRO-3, the parameter and gradients of each tile can be fetched and released one at a time, reducing the working memory proportional to the number of tiles. Therefore, ZeRO-Infinity can support operators of arbitrary sizes, without refactoring for model parallelism to fit them in limited GPU memory.

class deepspeed.zero.TiledLinear(in_features, out_features, bias=True, in_splits=1, out_splits=1, input_is_already_split=False, combine_out_splits=True, linear_cls=<class 'torch.nn.modules.linear.Linear'>, init_linear=None, **kwargs)

A replacement for torch.nn.Linear that works with ZeRO-3 to reduce memory requirements via tiling.

TiledLinear breaks the input and output dimensions of a linear layer into tiles that are processed in sequence. This class enables huge linear layers when combined with ZeRO-3 because inactive tiles can be partitioned and offloaded.

Note

We recommend using as few tiles as necessary. Tiling significantly reduces memory usage, but can reduce throughput for inexpensive layers. This due to the smaller kernels having less parallelism and lower arithmetic intensity, while introducing more frequent synchronization and communication.

Parameters
  • in_features (int) – See torch.nn.Linear

  • out_features (int) – See torch.nn.Linear

  • bias (bool, optional) – See torch.nn.Linear

  • in_splits (int, optional) – The number of tiles along the input dimension. Defaults to 1.

  • out_splits (int, optional) – The number of tiles along the output dimension. Defaults to 1.

  • input_is_already_split (bool, optional) – If set to True, assume that the input_ in to forward() is already split into in_splits chunks. Defaults to False.

  • combine_out_splits (bool, optional) – If set to False, do not combine the out_splits outputs into a single tensor. Defaults to True.

  • linear_cls (class, optional) – The underlying class to build individual tiles. Defaults to torch.nn.Linear.

  • init_linear (torch.nn.Linear, optional) – If set, copy the parameters of init_linear. Useful for debugging. Defaults to None.

  • kwargs (dict, optional) – additional keyword arguments to provide to linear_cls().

Raises
  • RuntimeErrorin_splits must be within the range [1, in_features).

  • RuntimeErrorout_splits must be within the range of [1, out_features).

forward(input_)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

copy_params_from(other)

Copy the weight and bias data from other.

This is especially useful for reproducible initialization and testing.

Equivalent to:

with torch.no_grad():
    self.weight.copy_(other.weight)
    if self.bias is not None:
        self.bias.copy_(other.bias)

Note

If ZeRO-3 is enabled, this is a collective operation and the updated parameters of data-parallel rank 0 will be visible on all ranks. See deepspeed.zero.GatheredParameters for more information.

Parameters

other (torch.nn.Linear) – the linear layer to copy from.

Debugging

Debugging ZeRO training is complicated by the partitioning of parameters, gradients, and optimizer states. None of these 3 groups of tensors (model states) can be normally accessed because of that. To overcome that DeepSpeed provides the following routines for accessing individual model states in both their partitioned (local) and unpartitioned (full) forms.

Important notes:

# These APIs return tensors that are on accelerator device even if the corresponding model state is offloaded to CPU or NVMe.

# To access the unpartitioned (full) form, these utilities must be called by all processes participating in the training, even if you decide to do something with the result only in the main process. If all processes don’t participate these utilities will hang waiting for all processes to send their contribution.

# You must be aware that these routines return correct data only in specific phases of the training. So for examples the gradients are valid after backward and before step. The optimizer states are updated after step. Same goes for fp32 master weights.

deepspeed.utils.safe_get_full_fp32_param(param)[source]

Assemble and return the fp32 parameter of a low-precision (e.g., fp16) parameter.

Parameters

param (torch.nn.Parameter) – A model parameter

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

deepspeed.utils.safe_get_full_grad(param)[source]

Assemble and return the fp32 gradient of a low-precision (e.g., fp16) parameter. The return data type is that used for gradient accumulation. This is usually the param data type, but could also be different (e.g., bf16 param training with fp32 gradient accumulation).

Parameters

param (torch.nn.Parameter) – A model parameter

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

deepspeed.utils.safe_get_full_optimizer_state(param, optim_state_key)[source]

Assemble and return the fp32 optimizer state of a low-precision (e.g., fp16) parameter.

Parameters
  • param (torch.nn.Parameter) – A model parameter

  • optim_state_key (string) – Key value of optimizer state (e.g., exp_avg in Adam optimizer)

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

deepspeed.utils.safe_get_local_fp32_param(param)[source]

Get the local partition of a ZeRO-3 partitioned parameter in fp32 precision.

Parameters

param (torch.nn.Parameter) – A model parameter.

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

deepspeed.utils.safe_get_local_grad(param)[source]

Get the local gradient partition of a ZeRO-3 partitioned parameter. The return data type is that used for gradient accumulation. This is usually the param data type, but could also be different (e.g., bf16 param training with fp32 gradient accumulation).

Parameters

param (torch.nn.Parameter) – A model parameter

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

deepspeed.utils.safe_get_local_optimizer_state(param, optim_state_key)[source]

Get the local optimizer state partition of ZeRO-3 partitioned parameter in fp32 precision.

Parameters
  • param (torch.nn.Parameter) – A model parameter

  • optim_state_key (string) – Key value of optimizer state (e.g., exp_avg in Adam optimizer)

Returns

A tensor on accelerator device

Return type

Union[torch.Tensor, None]

These routines can be used in a training loop as shown in the following snippet.

backward(loss)
[...]
from deepspeed.utils import safe_get_full_fp32_param, safe_get_full_grad, safe_get_full_optimizer_state
for n, lp in model.named_parameters():
    # 1. Access the full states
    #  1.1) gradient lookup
    # For zero1 and zero2, gradient lookup must be called after `backward` and before `step`
    # For zero3, gradient lookup must be called after `backward`
    hp_grad = safe_get_full_grad(lp)


    # 1.2) fp32 and optim states can probably be called anywhere in the training loop, but will be updated after `step`
    hp = safe_get_full_fp32_param(lp)
    exp_avg = safe_get_full_optimizer_state(lp, "exp_avg")
    exp_avg_sq = safe_get_full_optimizer_state(lp, "exp_avg_sq")

    # 2. Access the local states (zero3)
    # For zero3, all of the parameters, gradients, and optimizer states are partitioned,
    # and each process can access its corresponding local state.
    local_hp = safe_get_local_fp32_param(lp)
    local_hp_grad = safe_get_local_grad(lp)
    local_exp_avg = safe_get_local_optimizer_state(lp, "exp_avg")
    local_exp_avg_sq = safe_get_local_optimizer_state(lp, "exp_avg_sq")

[...]
optimizer.step()

Modifying Partitioned States

Sometimes, a user may want to modify parameters, gradients, or optimizer states outside of the regular training loop. This is currently difficult in ZeRO training because of partitioning. To overcome that, DeepSpeed provides the following routines for modifying the fp32 master parameters and the fp32 optimizer states.

deepspeed.utils.safe_set_full_fp32_param(param, value)[source]

Update the partitioned fp32 parameter of a low-precision (e.g., fp16) parameter.

Parameters
  • param (torch.nn.Parameter) – A model parameter

  • value (torch.Tensor) – New value

deepspeed.utils.safe_set_full_optimizer_state(param, value, optim_state_key)[source]

Update the partitioned fp32 optimizer state of a low-precision (e.g., fp16) parameter.

Parameters
  • param (torch.nn.Parameter) – A model parameter

  • value (torch.Tensor) – New value

  • optim_state_key (string) – Key value of optimizer state (e.g., exp_avg in Adam optimizer)

deepspeed.utils.safe_set_full_grad(param, value)[source]

Update the partitioned gradient of a low-precision (e.g., fp16) parameter. To avoid precision issues, the update value should have the data type of gradient accumulation.

Parameters
  • param (torch.nn.Parameter) – A model parameter

  • value (torch.Tensor) – The un-partitioned new gradient value.

deepspeed.utils.safe_set_local_fp32_param(param, value)[source]

Update the local partition of ZeRO-3 partitioned parameter.

Parameters
  • param (torch.nn.Parameter) – A model parameter.

  • value (torch.Tensor) – New value of local parameter partition.

deepspeed.utils.safe_set_local_grad(param, value)[source]

Update the local gradient partition of a ZeRO-3 partitioned parameter. To avoid precision issues, the update value should have the data type of gradient accumulation.

Parameters
  • param (torch.nn.Parameter) – A model parameter.

  • value (torch.Tensor) – New value of local gradient partition.

deepspeed.utils.safe_set_local_optimizer_state(param, value, optim_state_key)[source]

Update the local optimizer state partition of a ZeRO-3 partitioned parameter.

Parameters
  • param (torch.nn.Parameter) – A model parameter.

  • value (torch.Tensor) – New value of local optimizer state partition.

  • optim_state_key (string) – Key value of optimizer state (e.g., exp_avg in Adam optimizer).

deepspeed.utils.safe_update_full_grad_vectorized(param_list: List[Parameter], update_func: Callable)[source]

Vectorized update of the partitioned gradients of a list of low-precision (e.g., fp16) parameters. To avoid precision issues, the update value should have the data type of gradient accumulation.

Parameters
  • param_list (List[torch.nn.Parameter]) – List of model parameters

  • update_func (torch.Tensor) – A function that takes current full gradient value and returns new one.

The routines for modifying parameters and optimizer states can be used at any point after initialization of the DeepSpeed engine (i.e., deepspeed.initialize()) as shown in the following snippet.

[...]
from deepspeed.runtime.zero.utils import is_zero_param
from deepspeed.utils import safe_set_full_fp32_param, safe_set_full_optimizer_state
from deepspeed.utils import safe_set_local_fp32_param, safe_set_local_optimizer_state
# Here is an example to zero all the fp32 parameters and optimizer states.
for n, lp in model.named_parameters():
    # 1. For zero stage 1, 2, or 3 set the full fp32 and their full optim states
    zero_tensor = torch.zeros(lp.ds_shape) if is_zero_param(lp) else torch.zeros(lp.shape)

    safe_set_full_fp32_param(lp, zero_tensor)
    safe_get_full_optimizer_state(lp, zero_tensor, "exp_avg")
    safe_get_full_optimizer_state(lp, zero_tensor, "exp_avg_sq")

    # 2. For zero stage 3, each process sets its local fp32 parameters and their local optimizer states individually
    zero_tensor_local = torch.zeros(lp.ds_tensor.shape)

    safe_set_local_fp32_param(lp, zero_tensor_local)
    safe_set_local_optimizer_state(lp, zero_tensor_local, "exp_avg")
    safe_set_local_optimizer_state(lp, zero_tensor_local, "exp_avg_sq")

[...]

The routines for modifying gradients can be used after backward but before step as shown in the following snippet.

backward(loss)
[...]
from deepspeed.runtime.zero.utils import is_zero_param
from deepspeed.utils import safe_set_full_grad, safe_set_local_grad
# Here is an example of how to zero all the gradients.
for n, lp in model.named_parameters():
    # 1. For zero stage 1, 2, or 3 set the full gradient.
    zero_tensor = torch.zeros(lp.ds_shape) if is_zero_param(lp) else torch.zeros(lp.shape)

    safe_set_full_grad(lp, zero_tensor)

    # 2. For zero stage 3, each process sets its local gradient partition.
    zero_tensor_local = torch.zeros_like(lp.ds_tensor.shape)

    safe_set_local_grad(lp, zero_tensor_local)

[...]
optimizer.step()

GPU Memory Management

By default at the end of training with ZeRO stage 3 some parameters could remain unpartitioned and use up some gpu memory. This is done on purpose as an optimization should you resume training again. If you’d like to clear out the cached parameters that use up gpu memory, you can call empty_partition_cache method of a DeepSpeed engine.

The following code snippet illustrates this functionality.

with zero.Init():
    model = MyLargeModel()

ds_engine, _, _, _ = deepspeed.initialize(model, ...)
for batch in ...:
    loss = ds_engine(batch)
    ds_engine.backward(batch)
    ds_engine.step()

# Free GPU memory consumed by model parameters
ds_engine.empty_partition_cache()

Offload States

The DeepSpeed engine maintains a set of states in device memory (e.g., CUDA memory). The following API allows you to offload these states to a different device (currently, only CPU memory is supported), reducing the memory footprint on the device.

def offload_states(self,
                   include: Container[OffloadStateTypeEnum] = None,
                   device: OffloadDeviceEnum = OffloadDeviceEnum.cpu,
                   pin_memory: bool = True,
                   non_blocking: bool = False) -> None:
    """Offload the engine's states to the specified device.

    Arguments:
        include: Optional. The set of states to offload. If not provided, all states are offloaded.
        device: Optional. The device to move the ZeRO optimizer buffers to. Currently only `OffloadDeviceEnum.cpu` is supported.
        pin_memory: Optional. Whether to pin the memory of the offloaded states.
        non_blocking: Optional. Whether to offload the states asynchronously.
    """

You can selectively offload specific states by specifying the OffloadStateTypeEnum in the include argument. OffloadStateTypeEnum is an enum that defines the states that can be offloaded. The following states are supported:

  • OffloadStateTypeEnum.optim_states: Optimizer states. Currently, only states of DeepSpeed’s FusedAdam optimizer are supported.

  • OffloadStateTypeEnum.hp_params: FP32 parameters.

  • OffloadStateTypeEnum.lp_params: BF16/FP16 parameters.

  • OffloadStateTypeEnum.lp_grads: BF16/FP16 gradients.

  • OffloadStateTypeEnum.contiguous_grad_buffer: The contiguous gradient buffer for reduce operations.

Note that offloading states comes with a trade-off between memory savings and computational overhead. This API allows states to be reloaded back into device memory when needed.

def reload_states(self, non_blocking: bool = False) -> None:
    """Reload the engine states to the original device.

    Arguments:
        non_blocking: Optional. Whether to offload the states asynchronously.
    """

Below is an example code snippet demonstrating how to offload FP32 parameters and optimizer states to CPU memory:

# Offload after forward, backward, and step
ds_engine.offload_states(include=[OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.optim_states])

# Do something requiring a lot of device memory
...
# Load states back to device memory
ds_engine.reload_states()

deepspeed.runtime.zero.offload_states.get_state_devices returns devices of the specified state.

def get_state_devices(model, state: OffloadStateTypeEnum) -> Set[torch.device]:
    """Retrieve the devices of the specified state of the model.

    Args:
        model (DeepSpeedEngine): The model whose device allocations are to be checked.
        state (OffloadStateTypeEnum): The specific state for which the devices should be retrieved.

    Returns:
        Set[torch.device]: A set of devices of the specified state.

    """