Quantization tools
Quantizing data
Concrete-ML has support for quantized ML models and also provides quantization tools for Quantization Aware Training and Post-Training Quantization. The core of this functionality is the conversion of floating point values to integers and back. This is done using QuantizedArray
in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized:
n_bits
that defines the precision of the quantizationvalues
are floating point values that will be converted to integersis_signed
determines if the quantized integer values should allow negative valuesis_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
See also the UniformQuantizer reference for more information:
from concrete.ml.quantization import QuantizedArray
import numpy
numpy.random.seed(0)
A = numpy.random.uniform(-2, 2, 10)
print("A = ", A)
# array([ 0.19525402, 0.86075747, 0.4110535, 0.17953273, -0.3053808,
# 0.58357645, -0.24965115, 1.567092 , 1.85465104, -0.46623392])
q_A = QuantizedArray(7, A)
print("q_A.qvalues = ", q_A.qvalues)
# array([ 37, 73, 48, 36, 9,
# 58, 12, 112, 127, 0])
# the quantized integers values from A.
print("q_A.quantizer.scale = ", q_A.quantizer.scale)
# 0.018274684777173276, the scale S.
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# 26, the zero point Z.
print("q_A.dequant() = ", q_A.dequant())
# array([ 0.20102153, 0.85891018, 0.40204307, 0.18274685, -0.31066964,
# 0.58478991, -0.25584559, 1.57162289, 1.84574316, -0.4751418 ])
# Dequantized values.
It is also possible to use symmetric quantization, where the integer values are centered around 0:
q_A = QuantizedArray(3, A)
print("Unsigned: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Unsigned: q_A.qvalues = [2 4 2 2 0 3 0 6 7 0]
# q_A.quantizer.zero_point = 1
q_A = QuantizedArray(3, A, is_signed=True, is_symmetric=True)
print("Signed Symmetric: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Signed Symmetric: q_A.qvalues = [ 0 1 1 0 0 1 0 3 3 -1]
# q_A.quantizer.zero_point = 0
In the following example, showing the de-quantization of model outputs, the QuantizedArray
class is used in a different way. Here it uses pre-quantized integer values and has the scale
and zero-point
set explicitly. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
import numpy
def dequantize_output(self, qvalues: numpy.ndarray) -> numpy.ndarray:
# .....
# Assume: qvalues is the decrypted integer output of the model
# .....
QuantizedArray(
output_layer.n_bits,
qvalues,
value_is_float=False,
scale=output_layer.output_scale,
zero_point=output_layer.output_zero_point,
).dequant()
# ....
Quantized modules
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in an equivalent way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
In Concrete-ML, the quantized equivalent of a scikit-learn model or a PyTorch nn.Module
is the QuantizedModule
. Note that only inference is implemented in the QuantizedModule
, and it is built through a conversion of the inference function of the corresponding scikit-learn or PyTorch module.
Built-in neural networks expose the quantized_module
member, while a QuantizedModule
is also the result of the compilation of custom models through compile_torch_model
and compile_brevitas_qat_model
.
The quantized versions of floating point model operations are stored in the QuantizedModule
. The ONNX_OPS_TO_QUANTIZED_IMPL
dictionary maps ONNX floating point operators (e.g. Gemm) to their quantized equivalent (e.g. QuantizedGemm). For more information on implementing these operations, please see the FHE compatible op-graph section.
The computation graph is taken from the corresponding floating point ONNX graph exported from scikit-learn using HummingBird, or from the ONNX graph exported by PyTorch. Calibration is used to obtain quantized parameters for the operations in the QuantizedModule
. Parameters are also determined for the quantization of inputs during model deployment.
That QuantizedModule
generates the Concrete-Numpy function that is compiled to FHE. The compilation will succeed if the intermediate values conform to the 8-bits precision limit of the Concrete stack. See the compilation section for details.
Resources
Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Last updated
Was this helpful?