In this section, we detail the usage of quantization in Concrete-ML.
Since quantization is necessary to make ML models work in FHE, Concrete-ML implements quantized ML models to facilitate usage, but also exposes some quantization tools. The core of this functionality is the conversion of floating point values to integers, following the techniques described here. We can apply this conversion using QuantizedArray
, available in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized:
n_bits
that defines the precision of the quantization
values
are floating point values that will be converted to integers
is_signed
determines if the quantized integer values should allow negative values
is_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
Please see the API reference for more information.
We can also use symmetric quantization, where the integer values are centered around 0 and may, thus, take negative values.
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in the same way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
The ML models implemented in Concrete-ML provide features to let the user quantize the input data and dequantize the output data.
Here is a simple example showing how to perform inference, starting from float values and ending up with float values. Note that the FHE engine that is compiled for the ML models does not support data batching.
If we are to examine the operations done by quantize_input
and dequantize_output
, we will see usage of the QuantizedArray
described above. When the ML model quantized_module
is calibrated, the min and max of the value distributions will be recorded, and these are then applied to quantize/dequantize new data.
Here, a different usage of QuantizedArray
is shown, where it is constructed from quantized integer values and the scale
and zero-point
are set explicitly from calibration parameters. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
Intermediary values computed during model inference might need to be re-scaled into the quantized domain of a subsequent model operator. For example, the output of a convolution layer in a neural network might have values that are 7 bits wide, but the next convolutional layer requires that its inputs are, at most, 2 bits wide. In the non-encrypted realm, this implies that we need to make use of floating point operations. In the FHE setting, where we only work with integers, this could be a problem, but, luckily, the FHE implementation behind Concrete-ML provides a solution. We essentially make use of a table lookup, which is later translated into a Programmable Bootstrap (PBS).
Of course, having a PBS for every quantized addition isn't recommended for computational cost reasons. Also, a PBS is currently only allowed for univariate operations (i.e. matrix multiplication can't be done in a PBS). Therefore, our quantized modules split the computation of floating point values and unsigned integers, as it is currently done in concrete.ml.quantization.QuantizedLinear
. Moreover, the operations done by the activation function of a previous layer and additional re-scaling to the new quantized domain, which are all floating point operations, can be fused to a single TLU. Concrete-ML implements quantized operators that perform this fusion, significantly reducing the number of TLUs necessary to perform inference.
We can distinguish three types of operators:
Operators that perform linear combinations of encrypted and constant (clear) values. For example: matrix multiplication, convolution, addition
Operators that perform element-wise operations between two encrypted tensors. For example: addition
Element-wise, fixed-function operators which can be: addition with a constant, activation functions
In the first category, we will find operators such as Gemm
, which will quantize their inputs. Notice that here we use the _prepare_inputs_with_constants
helper function, with quantize_actual_values=True
, to apply the quantization function to the input data. The quantization function operators using floating point and a non-linear function, round
, will thus produce a TLU, together with any preceding floating point operations.
For element-wise operations with a fixed function, we simply let Concrete-Numpy generate a TLU. To do so, we just need to give this function the corresponding NumPy implementation, which must be defined in ops_impl.py.