Quantization
Last updated
Last updated
Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers).
This means that some accuracy in the representation is lost (e.g. a simple approach is to eliminate least-significant bits), but, in many cases in machine learning, it is possible to adapt the models to give meaningful results while using these smaller data types. This significantly reduces the number of bits necessary for intermediary results during the execution of these machine learning models.
Since FHE is currently limited to 8-bit integers, it is necessary to quantize models to make them compatible. As a general rule, the smaller the precision models use, the better the FHE performance.
Let be the range of our value to quantize where is the minimum and is the maximum. To quantize a range with floating point values (in ) to integer values (in ), we first need to choose the data type that is going to be used. Concrete, the framework used by Concrete-ML, is currently limited to 8-bit integers, so this will be the value used in this example. Knowing the number of bits that can be used for a value in the range , we can compute the scale
of the quantization:
where is the number of bits (). For the sake of example, let's take .
In practice, the quantization scale is then . This means the gap between consecutive representable values cannot be smaller than , which, in turn, means there can be a substantial loss of precision. Every interval of length will be represented by a value within the range .
The other important parameter from this quantization schema is the zero point
value. This essentially brings the 0 floating point value to a specific integer. If the quantization scheme is asymmetric (quantized values are not centered in 0), the resulting integer will be in .
When using quantized values in a matrix multiplication or convolution, the equations for computing the result become more complex. The IntelLabs distiller quantization documentation provides a more detailed explanation of the maths to quantize values and how to keep computations consistent.
Quantization implemented in Concrete-ML is done in two ways:
The quantization is done automatically during the model's FHE compilation process. This approach requires little work by the user, but may not be a one-size-fits-all solution for all types of models. The final quantized model is FHE friendly and ready to predict over encrypted data. This approach is done using Post-Training Quantization (PTQ).
In some cases (when doing extreme quantization) PTQ is not sufficient to achieve a decent final model accuracy. Concrete-ML offer the possibility for the user to do quantization before compilation to FHE, for example using Quantization-Aware Training (QAT). This can be done by any means, including by using third-party frameworks. In this approach, the user is responsible for implementing a full-integer model respecting the FHE constraints.
Concrete-ML has support for quantized ML models as well as quantization tools. The core of this functionality is the conversion of floating point values to integers. This is done using QuantizedArray
in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized (see the API reference for more information):
n_bits
that defines the precision of the quantization
values
are floating point values that will be converted to integers
is_signed
determines if the quantized integer values should allow negative values
is_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
It is also possible to use symmetric quantization, where the integer values are centered around 0:
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in the same way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
The models implemented in Concrete-ML provide features to let the user quantize the input data and dequantize the output data.
Here is a simple example showing how to perform inference, starting from float values and ending up with float values. Note that the FHE engine that is compiled for the ML models does not support data batching.
The functions quantize_input
and dequantize_output
make use of QuantizedArray
described above. When the ML model quantized_module
is calibrated, the min and max of the value distributions will be stored and applied to quantize/dequantize new data.
In the following example, QuantizedArray
is used in a different way, using pre-quantized integer values and having the scale
and zero-point
set explicitly from calibration parameters. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
Intermediary values computed during model inference might need to be re-scaled into the quantized domain of a subsequent model operator. For example, the output of a convolution layer in a neural network might have values that are 8 bits wide, with the next convolutional layer requiring that its inputs are at most 2 bits wide. In the non-encrypted realm, this implies that we need to make use of floating point operations. To make this work with integers as required by FHE, Concrete-ML uses a table lookup (TLU), which is a way to encode univariate functions in FHE. Table lookups are expensive in FHE, and so should only be used when necessary.
The operations done by the activation function of a previous layer and additional re-scaling to the new quantized domain, which are all floating point operations, can be fused to a single TLU. Concrete-ML implements quantized operators that perform this fusion, significantly reducing the number of TLUs necessary to perform inference.
There are 3 types of operators:
Operators that perform linear combinations of encrypted and constant (clear) values. For example: matrix multiplication, convolution, addition
Operators that perform element-wise operations between two encrypted tensors. For example: addition
Element-wise, fixed-function operators which can be: addition with a constant, activation functions
The following example shows how to use the _prepare_inputs_with_constants
helper function with quantize_actual_values=True
to apply the quantization function to the input data of the Gemm
operator. Since the quantization function uses floats and a non-linear function (round
), a TLU will automatically be generated together with quantization.
TLU generation for element-wise operations can be delegated to Concrete-Numpy directly by calling the function's corresponding NumPy implementation, as defined in ops_impl.py.
IntelLabs distiller explanation of quantization: Distiller documentation
Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference