Training API

deepspeed.initialize() returns a training engine in its first argument of type DeepSpeedEngine. This engine is used to progress training:

for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

Forward Propagation

deepspeed.DeepSpeedEngine.forward(*args, **kwargs)

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Backward Propagation

deepspeed.DeepSpeedEngine.backward(*args, **kwargs)

Optimizer Step

deepspeed.DeepSpeedEngine.step(self, lr_kwargs=None): Execute the weight update step after forward and backward propagation on effective_train_batch.

Gradient Accumulation

deepspeed.DeepSpeedEngine.is_gradient_accumulation_boundary(self)

Query whether the current micro-batch is at the boundary of gradient accumulation, and thus will trigger gradient reductions and an optimizer step.

Returns: if the current step is a gradient accumulation boundary.
Return type: bool

Model Saving

deepspeed.DeepSpeedEngine.save_16bit_model(self, save_dir, save_filename='pytorch_model.bin', exclude_frozen_parameters=False)

Save 16bit model weights

This method saves the 16bit model weights at the desired destination.

Parameters

save_dir – Required. Directory for saving the model
save_filename – Optional. Filename to save to. Defaults to pytorch_model.bin
exclude_frozen_parameters – Optional. Exclude frozen parameters from checkpointed state.

Returns

True when a model has been saved, False otherwise. It will not be saved if stage3_gather_16bit_weights_on_model_save is False.

Important: all processes must call this method and not just the process with rank 0. It is because the processes need to work in sync to gather the weights. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

Additionally when a DeepSpeed checkpoint is created, a script zero_to_fp32.py is added there which can be used to reconstruct fp32 master weights into a single pytorch state_dict file.

Training Multiple Models

DeepSpeed supports training multiple models, which is a useful feature in scenarios such as knowledge distillation and post-training RLHF. The core approach is to create individual DeepSpeedEngines for each model.

Training Independent Models

The following code snippet illustrates independently training multiple models on the same dataset.

model_engines = [engine for engine, _, _, _ in [deepspeed.initialize(m, ...,) for m in models]]
for batch in data_loader:
   losses = [engine(batch) for engine in model_engines]
   for engine, loss in zip(model_engines, losses):
      engine.backward(loss)

The above is similar to typical DeepSpeed usage except for the creation of multiple DeepSpeedEngines (one for each model).

Jointly Training Models With Shared Loss

The following code snippet illustrates jointly training multiple models on a shared loss value.

model_engines = [engine for engine, _, _, _ in [deepspeed.initialize(m, ...,) for m in models]]
for batch in data_loader:
    losses = [engine(batch[0], batch[1]) for engine in model_engines]
    loss = sum(l / (i + 1) for i, l in enumerate(losses))
    loss.backward()

    for engine in model_engines:
        engine._backward_epilogue()

    for engine in model_engines:
        engine.step()

    for engine in model_engines:
        engine.optimizer.zero_grad()

Besides the use of multiple DeepSpeedEngines, the above differs from typical usage in two key ways:

The backward call is made using the common loss value rather on individual model engines.
_backward_epilogue is called on model engine, after the loss.backward().