Quantization tools
Quantizing data
Concrete-ML has support for quantized ML models and also provides quantization tools for Quantization Aware Training and Post-Training Quantization. The core of this functionality is the conversion of floating point values to integers and back. This is done using QuantizedArray
in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized:
n_bits
that defines the precision of the quantizationvalues
are floating point values that will be converted to integersis_signed
determines if the quantized integer values should allow negative valuesis_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
See also the UniformQuantizer reference for more information:
It is also possible to use symmetric quantization, where the integer values are centered around 0:
In the following example, showing the de-quantization of model outputs, the QuantizedArray
class is used in a different way. Here it uses pre-quantized integer values and has the scale
and zero-point
set explicitly. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
Quantized modules
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in an equivalent way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
In Concrete-ML, the quantized equivalent of a scikit-learn model or a PyTorch nn.Module
is the QuantizedModule
. Note that only inference is implemented in the QuantizedModule
, and it is built through a conversion of the inference function of the corresponding scikit-learn or PyTorch module.
Built-in neural networks expose the quantized_module
member, while a QuantizedModule
is also the result of the compilation of custom models through compile_torch_model
and compile_brevitas_qat_model
.
The quantized versions of floating point model operations are stored in the QuantizedModule
. The ONNX_OPS_TO_QUANTIZED_IMPL
dictionary maps ONNX floating point operators (e.g. Gemm) to their quantized equivalent (e.g. QuantizedGemm). For more information on implementing these operations, please see the FHE compatible op-graph section.
The computation graph is taken from the corresponding floating point ONNX graph exported from scikit-learn using HummingBird, or from the ONNX graph exported by PyTorch. Calibration is used to obtain quantized parameters for the operations in the QuantizedModule
. Parameters are also determined for the quantization of inputs during model deployment.
That QuantizedModule
generates the Concrete-Numpy function that is compiled to FHE. The compilation will succeed if the intermediate values conform to the 8-bits precision limit of the Concrete stack. See the compilation section for details.
Resources
Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Last updated
Was this helpful?