Non-uniform Quantization
In certain cases, it may be useful to combine quantization schemes of different precisions and/or strategies to achieve better recovery. For example, in some decoder-only models, the down_proj
layer has shown greater sensitivity, and performance can be improved by quantizing this layer to int8 or fp8 instead of int4 or fp4. The examples in this folder illustrate several cases of non-uniform quantization.
Mixed-Precision Quantization
We demonstrate mixed precision by quantizing models to both int8 and int4, and in a second example, to both fp4 (specifically, nvfp4) and fp8. In both cases, we use config groups to assign higher precision to the down_proj
layer and lower precision to the remaining linear layers. For nvfp4 and fp8, we also apply two model compressors—nvfp4-pack-quantized
and float-quantized
. The resulting compressed model’s config.json shows mixed-precision
as the value for format
, indicating that the model has been compressed using multiple formats. The specific format applied to each set of layers is specified under each config group’s format
key.
Multiple Strategies
It may also be interesting to quantize a model with two different quantization strategies such as group, channel, or per-tensor. Here we apply fp8 quantization where all the attention weights are quantized using the per-channel strategy, and all the mlp weights are quantized using per-tensor. This is accomplished through defining multiple config groups in the recipe. The produced model is compressed using the float-quantized
compressor and can be directly run in vllm.