Concrete ML has support for quantized ML models and also provides quantization tools for Quantization Aware Training and Post-Training Quantization. The core of this functionality is the conversion of floating point values to integers and back. This is done using QuantizedArray
in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized:
n_bits
defines the precision used in quantization
values
are floating point values that will be converted to integers
is_signed
determines if the quantized integer values should allow negative values
is_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
See also the UniformQuantizer reference for more information:
It is also possible to use symmetric quantization, where the integer values are centered around 0:
In the following example, showing the de-quantization of model outputs, the QuantizedArray
class is used in a different way. Here it uses pre-quantized integer values and has the scale
and zero-point
set explicitly. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the fhe_circuit.encrypt_run_decrypt(..)
call.
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions, and element-wise operations. When working with quantized values, these operations cannot be carried out in an equivalent way to floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
In Concrete ML, the quantized equivalent of a scikit-learn model or a PyTorch nn.Module
is the QuantizedModule
. Note that only inference is implemented in the QuantizedModule
, and it is built through a conversion of the inference function of the corresponding scikit-learn or PyTorch module.
Built-in neural networks expose the quantized_module
member, while a QuantizedModule
is also the result of the compilation of custom models through compile_torch_model
and compile_brevitas_qat_model
.
The quantized versions of floating point model operations are stored in the QuantizedModule
. The ONNX_OPS_TO_QUANTIZED_IMPL
dictionary maps ONNX floating point operators (e.g., Gemm) to their quantized equivalent (e.g., QuantizedGemm). For more information on implementing these operations, please see the FHE-compatible op-graph section.
The computation graph is taken from the corresponding floating point ONNX graph exported from scikit-learn using HummingBird, or from the ONNX graph exported by PyTorch. Calibration is used to obtain quantized parameters for the operations in the QuantizedModule
. Parameters are also determined for the quantization of inputs during model deployment.
Calibration is the process of determining the typical distributions of values encountered for the intermediate values of a model during inference.
To perform calibration, an interpreter goes through the ONNX graph in topological order and stores the intermediate results as it goes. The statistics of these values determine quantization parameters.
That QuantizedModule
generates the Concrete function that is compiled to FHE. The compilation will succeed if the intermediate values conform to the 16-bits precision limit of the Concrete stack. See the compilation section for details.
Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference