Advanced features
Last updated
Last updated
Concrete ML provides features for advanced users to adjust cryptographic parameters generated by the Concrete stack. This allows users to identify the best trade-off between latency and performance for their specific machine learning models.
Concrete ML makes use of table lookups (TLUs) to represent any non-linear operation (e.g., a sigmoid). TLUs are implemented through the Programmable Bootstrapping (PBS) operation, which applies a non-linear operation in the cryptographic realm.
The result of TLU operations is obtained with a specific tolerance to off-by-one errors. Concrete ML offers the possibility to set the probability of such errors occurring, which influences the cryptographic parameters. The lower the tolerance, the more restrictive the parameters become, making both key generation and, more significantly, FHE execution time slower.
Concrete ML has a simulation mode where the impact of approximate computation of TLUs on the model accuracy can be determined. The simulation is much faster, speeding up model development significantly. The behavior in simulation mode is representative of the behavior of the model on encrypted data.
In Concrete ML, there are three different ways to define the tolerance to off-by-one errors for each TLU operation:
setting p_error
, the error probability of an individual TLU (see here)
setting global_p_error
, the error probability of the full circuit (see here)
not setting p_error
nor global_p_error
, and using default parameters (see here)
p_error
and global_p_error
cannot be set at the same time, as they are incompatible with each other.
The first way to set error probabilities in Concrete ML is at the local level, by directly setting the tolerance to error of each individual TLU operation (such as activation functions for a neuron output). This tolerance is referred to as p_error
. A given PBS operation has a 1 - p_error
chance of being correct 100% of the time. The successful evaluation here means that the value decrypted after FHE evaluation is exactly the same as the one that would be computed in the clear. Otherwise, off-by-one errors might occur, but, in practice, these errors are not necessarily problematic if they are sufficiently rare.
For simplicity, it is best to use default options, irrespective of the type of model. Especially for deep neural networks, default values may be too pessimistic, reducing computation speed without any improvement in accuracy. For deep neural networks, some TLU errors might not affect the accuracy of the network, so p_error
can be safely increased (e.g., see CIFAR classifications in our showcase).
Here is a visualization of the effect of the p_error
on a neural network model with a p_error = 0.1
compared to execution in the clear (i.e., no error):
Varying p_error
in the one hidden-layer neural network above produces the following inference times. Increasing p_error
to 0.1 halves the inference time with respect to a p_error
of 0.001. In the graph above, the decision boundary becomes noisier with a higher p_error
.
The speedup depends on model complexity, but, in an iterative approach, it is possible to search for a good value of p_error
to obtain a speedup while maintaining good accuracy. Concrete ML provides a tool to find a good value for p_error
based on binary search.
Users have the possibility to change this p_error
by passing an argument to the compile
function of any of the models. Here is an example:
If the p_error
value is specified and simulation is enabled, the run will take into account the randomness induced by the choice of p_error
. This results in statistical similarity to the FHE evaluation.
A global_p_error
is also available and defines the probability of 100% correctness for the entire model, compared to execution in the clear. In this case, the p_error
for every TLU is determined internally in Concrete such that the global_p_error
is reached for the whole model.
There might be cases where the user encounters a No cryptography parameter found
error message. Increasing the p_error
or the global_p_error
in this case might help.
Usage is similar to the p_error
parameter:
In the above example, XGBoostClassifier in FHE has a 1/10 probability to have a one-off output value compared to the expected value. The shift is relative to the expected value, so even if the result is different, it should be close to the expected value.
If neither p_error
or global_p_error
are set, Concrete ML employs p_error = 2^-40
by default.
Currently finding a good p_error
value a-priori is not possible, as it is difficult to determine the impact of the TLU error on the output of a neural network. Concrete ML provides a tool to find a good p_error
value that improves inference speed while maintaining accuracy. The method is based on binary search and evaluates the latency/accuracy trade-off iteratively.
With this optimal p_error
, accuracy is maintained while execution time is improved by a factor of 1.51.
Please note that the default setting for the search interval is restricted to a range of 0.0 to 0.9. Increasing the upper bound beyond this range may result in longer execution times, especially when p_error≈1
.
The rounding operation is defined as follows:
Then, the rounding operation can be computed as:
In Concrete ML, this feature is currently implemented for custom neural networks through the compile functions, including
concrete.ml.torch.compile_torch_model
,
concrete.ml.torch.compile_onnx_model
and
concrete.ml.torch.compile_brevitas_qat_model
.
The rounding_threshold_bits
argument can be set to a specific bit-width. It is important to choose an appropriate bit-width threshold to balance the trade-off between speed and accuracy. By reducing the bit-width of intermediate tensors, it is possible to speed-up computations while maintaining accuracy.
To find the best trade-off between speed and accuracy, it is recommended to experiment with different thresholds and check the accuracy on an evaluation set after compiling the model.
In practice, the process looks like this:
Set a rounding_threshold_bits
to a relatively high P. Say, 8 bits.
Check the accuracy
Update P = P - 1
repeat steps 2 and 3 until the accuracy loss is above a certain, acceptable threshold.
An example of such implementation is available in evaluate_torch_cml.py and CifarInFheWithSmallerAccumulators.ipynb
By using verbose = True
and show_mlir = True
during compilation, the user receives a lot of information from Concrete. These options are, however, mainly meant for power-users, so they may be hard to understand.
Here, one will see:
the computation graph (typically):
the MLIR, produced by Concrete:
information from the optimizer (including cryptographic parameters):
In this latter optimization, the following information will be provided:
The bit-width ("6-bit integers") used in the program: for the moment, the compiler only supports a single precision (i.e., that all PBS are promoted to the same bit-width - the largest one). Therefore, this bit-width predominantly drives the speed of the program, and it is essential to reduce it as much as possible for faster execution.
The maximal norm2 ("7 manp"), which has an impact on the crypto parameters: The larger this norm2, the slower PBS will be. The norm2 is related to the norm of some constants appearing in your program, in a way which will be clarified in the Concrete documentation.
The probability of error of an individual PBS, which was requested by the user ("3.300000e-02 error per pbs call" in User Config).
The probability of error of the full circuit, which was requested by the user ("1.000000e+00 error per circuit call" in User Config). Here, the probability 1 stands for "not used", since we had set the individual probability via p_error
.
The probability of error of an individual PBS, which is found by the optimizer ("1/30 errors (3.234529e-02)").
The probability of error of the full circuit which is found by the optimizer ("1/10 errors (9.390887e-02)").
An estimation of the cost of the circuit ("4.214000e+02 Millions Operations"): Large values indicate a circuit that will execute more slowly.
Here is some further information about cryptographic parameters:
1x glwe_dimension
2**11 polynomial (2048)
762 lwe dimension
keyswitch l,b=5,3
blindrota l,b=2,15
wopPbs : false
This optimizer feedback is a work in progress and will be modified and improved in future releases.
p_error | Inference Time (ms) |
---|---|
To speed-up neural networks, a rounding operator can be applied on the accumulators of linear and convolution layers to retain the most significant bits on which the activation and quantization is applied. The accumulator is represented using bits, and is the desired input bit-width of the TLU operation that computes the activation and quantization.
First, compute as the difference between , the actual bit-width of the accumulator, and :
where is the input number, and denotes the operation that rounds to the nearest integer.
The rounding_threshold_bits
parameter only works in FHE for TLU input bit-width () less or equal to 8 bits.
0.001
0.80
0.01
0.41
0.1
0.37