Step-by-step Guide
This guide provides a complete example of converting a PyTorch neural network into its FHE-friendly, quantized counterpart. It focuses on Quantization Aware Training a simple network on a synthetic data-set.
In general, quantization can be carried out in two different ways: either during Quantization Aware Training (QAT) or after the training phase with Post-Training Quantization (PTQ).
For a formal explanation of the mechanisms that enable FHE-compatible neural networks, please see the the following paper.
Baseline PyTorch model
In PyTorch, using standard layers, a fully connected neural network (FCNN) would look like this:
The network was trained using different numbers of neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size shown below is measured as the mean over 10 runs of the experiment. An accumulator of 6.6 means that 4 times out of 10 the accumulator measured was 6 bits while 6 times it was 7 bits.
This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bits accuracy remains low irrespective of the number of neurons. While all the configurations tried here were FHE-compatible (accumulator < 16 bits), it is often preferable to have a lower accumulator size in order to speed up inference time.
Accumulator size is determined by Concrete as being the maximum bit-width encountered anywhere in the encrypted circuit.
Quantization Aware Training:
Brevitas provides a quantized version of almost all PyTorch layers (Linear
layer becomes QuantLinear
, ReLU
layer becomes QuantReLU
and so on), plus some extra quantization parameters, such as :
bit_width
: precision quantization bits for activationsact_quant
: quantization protocol for the activationsweight_bit_width
: precision quantization bits for weightsweight_quant
: quantization protocol for the weights
In order to use FHE, the network must be quantized from end to end, and thanks to the Brevitas's QuantIdentity
layer, it is possible to quantize the input by placing it at the entry point of the network. Moreover, it is also possible to combine PyTorch and Brevitas layers, provided that a QuantIdentity
is placed after this PyTorch layer. The following table gives the replacements to be made to convert a PyTorch NN for Concrete ML compatibility.
Some PyTorch operators (from the PyTorch functional API), require a brevitas.quant.QuantIdentity
to be applied on their inputs.
The QAT import tool in Concrete ML is a work in progress. While it has been tested with some networks built with Brevitas, it is possible to use other tools to obtain QAT networks.
With Brevitas, the network above becomes:
In the network above, biases are used for linear layers but are not quantized ("bias": True, "bias_quant": None
). The addition of the bias is a univariate operation and is fused into the activation function.
Training this network with pruning (see below) with 30 out of 100 total non-zero neurons gives good accuracy while keeping the accumulator size low.
The PyTorch QAT training loop is the same as the standard floating point training loop, but hyper-parameters such as learning rate might need to be adjusted.
Quantization Aware Training is somewhat slower than normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs as its computational intensity is low with respect to data transfer time.
Pruning using Torch
Considering that FHE only works with limited integer precision, there is a risk of overflowing in the accumulator, which will make Concrete ML raise an error.
The following code shows how to use pruning in the previous example:
Results with PrunedQuantNet
, a pruned version of the QuantSimpleNet
with 100 neurons on the hidden layers, are given below, showing a mean accumulator size measured over 10 runs of the experiment:
This shows that the fp32 accuracy has been improved while maintaining constant mean accumulator size.
When pruning a larger neural network during training, it is easier to obtain a low bit-width accumulator while maintaining better final accuracy. Thus, pruning is more robust than training a similar, smaller network.
Last updated