Big Modeling with Sequential Onloading
What is Sequential Onloading?
Sequential onloading is a memory-efficient approach for compressing large language models (LLMs) using only a single GPU. Instead of loading the entire model into memory—which can easily require hundreds of gigabytes—this method loads and compresses one layer at a time. The outputs are offloaded before the next layer is processed, dramatically reducing peak memory usage while maintaining high compression fidelity.
For more information, see the RedHat AI blog post or the LLM Compressor Office Hours Recording.
Using Sequential Onloading
Sequential onloading is enabled by default within LLM Compressor. To disable sequential onloading, add the pipeline="basic"
argument to the LLM Compressor oneshot
function call.
Running Llama 3.3 70b
The Llama 3.3 70b is larger than 80 GB, surpassing the size of 1 A100. However, with sequential onloading, this model can still be quantized seamlessly using a single GPU.
Code Walkthough
model_id = "meta-llama/Llama-3.3-70B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map=None)
The model is first loaded onto the cpu
, as indicated through the use of None
for the device_map
argument in the from_pretrained
method when loading the model.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
oneshot
, only one gpu is required which will be used to onload each layer for calibration in a sequential manner. dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))
Finally, we call dispatch_for_generation
to evenly load the model across available devices (potentially offloading the model if required) and run sample generations on the newly quantized model.