# Quantization tools

## Quantizing data

Concrete-ML has support for quantized ML models and also provides quantization tools for Quantization Aware Training and Post-Training Quantization. The core of this functionality is the conversion of floating point values to integers and back. This is done using `QuantizedArray`

in `concrete.ml.quantization`

.

The `QuantizedArray`

class takes several arguments that determine how float values are quantized:

`n_bits`

that defines the precision of the quantization`values`

are floating point values that will be converted to integers`is_signed`

determines if the quantized integer values should allow negative values`is_symmetric`

determines if the range of floating point values to be quantized should be taken as symmetric around zero

See also the UniformQuantizer reference for more information:

It is also possible to use symmetric quantization, where the integer values are centered around 0:

In the following example, showing the de-quantization of model outputs, the `QuantizedArray`

class is used in a different way. Here it uses pre-quantized integer values and has the `scale`

and `zero-point`

set explicitly. Once the `QuantizedArray`

is constructed, calling `dequant()`

will compute the floating point values corresponding to the integer values `qvalues`

, which are the output of the `forward_fhe.encrypt_run_decrypt(..)`

call.

## Quantized modules

Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in an equivalent way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.

In Concrete-ML, the quantized equivalent of a scikit-learn model or a PyTorch `nn.Module`

is the `QuantizedModule`

. Note that only inference is implemented in the `QuantizedModule`

, and it is built through a conversion of the inference function of the corresponding scikit-learn or PyTorch module.

Built-in neural networks expose the `quantized_module`

member, while a `QuantizedModule`

is also the result of the compilation of custom models through `compile_torch_model`

and `compile_brevitas_qat_model`

.

The quantized versions of floating point model operations are stored in the `QuantizedModule`

. The `ONNX_OPS_TO_QUANTIZED_IMPL`

dictionary maps ONNX floating point operators (e.g. Gemm) to their quantized equivalent (e.g. QuantizedGemm). For more information on implementing these operations, please see the FHE compatible op-graph section.

The computation graph is taken from the corresponding floating point ONNX graph exported from scikit-learn using HummingBird, or from the ONNX graph exported by PyTorch. Calibration is used to obtain quantized parameters for the operations in the `QuantizedModule`

. Parameters are also determined for the quantization of inputs during model deployment.

Calibration is the process of determining the typical distributions of values encountered for the intermediate values of a model during inference.

To perform calibration, an interpreter goes through the ONNX graph in topological order and stores the intermediate results as it goes. The statistics of these values determine quantization parameters.

That `QuantizedModule`

generates the Concrete-Numpy function that is compiled to FHE. The compilation will succeed if the intermediate values conform to the 8-bits precision limit of the Concrete stack. See the compilation section for details.

## Resources

Lei Mao's blog on quantization: Quantization for Neural Networks

Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Last updated