Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Concrete-ML provides simple neural networks models with a Scikit-learn interface through the NeuralNetClassifier
and NeuralNetRegressor
classes. The neural network models are built with Skorch, which provides a scikit-learn like interface to Torch models (more here).
Currently, only linear layers are supported, but the number of layers, the activation function and the number of neurons in each layer is configurable. This approach is similar to what is available in Scikit-learn using the MLPClassifier
/MLPRegressor
classes. The built-in fully connected neural network (FCNN) models train easily with a single call to .fit()
, which will automatically quantize the weights and activations.
While NeuralNetClassifier
and NeuralNetClassifier
provide scikit-learn like models, their architecture is somewhat restricted in order to make training easy and robust. If you need more advanced models you can convert custom neural networks, as described in the FHE-friendly models documentation.
To create an instance of a Fully Connected Neural Network you need to instantiate one of the NeuralNetClassifier
and NeuralNetRegressor
classes and configure a number of parameters that are passed to their constructor. Note that some parameters need to be prefixed by module__
, while others don't. Basically, the parameters that are related to the model, i.e. the underlying nn.Module
, must have the prefix. The parameters that are related to training options do not require the prefix.
module__n_layers
: number of layers in the FCNN, must be at least 1
module__n_outputs
: number of outputs (classes or targets)
module__input_dim
: dimensionality of the input
module__activation_function
: can be one of the Torch activations (e.g. nn.ReLU, see the full list here)
n_w_bits
(default 3): number of bits for weights
n_a_bits
(default 3): number of bits for activations and inputs
n_accum_bits
(default 8): maximum accumulator bit width that is desired. The implementation will attempt to keep accumulators under this bitwidth through pruning, i.e. setting some weights to zero
max_epochs
: The number of epochs to train the network (default 10),
verbose
: Whether to log loss/metrics during training (default: False)
lr
: Learning rate (default 0.001)
Other parameters from skorch are in the Skorch documentation
module__n_hidden_neurons_multiplier
: The number of hidden neurons will be automatically set proportional to the dimensionality of the input (i.e. the vlaue for module__input_dim
). This parameter controls the proportionality factor, and is by default set to 4. This value gives good accuracy while avoiding accumulator overflow.
When you have training data in the form of a Numpy array, and targets in a Numpy 1d array, you can set:
You can give weights to each class, to use in training. Note that this must be supported by the underlying torch loss function.
The n_hidden_neurons_multiplier
parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing n_hidden_neurons_multiplier
improves accuracy, but should take into account precision limitations to avoid overflow in the accumulator. The default value is a good compromise that avoids overflow, in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors. A value of 1 should be completely safe with respect to overflow.
Installing Concrete-ML using PyPi requires a Linux-based OS or macOS running on an x86 CPU. For Apple Silicon, Docker is the only currently supported option (see below).
Installing on Windows can be done using Docker or WSL. On WSL, Concrete-ML will work as long as the package is not installed in the /mnt/c/ directory, which corresponds to the host OS filesystem.
To install Concrete-ML from PyPi, run the following:
Concrete-ml can be installed using Docker by either pulling the latest image or a specific version:
The image can be used with Docker volumes, see the Docker documentation here.
The image can then be used via the following command:
This will launch a Concrete-ML enabled Jupyter server in Docker that can be accessed directly from a browser.
Alternatively, a shell can be lauched in Docker, with or without volumes:
Concrete-ML is built on top of Concrete-Numpy, which enables Numpy programs to be converted into FHE circuits.
The lifecycle of a Concrete ML program is as follows:
Training. A model is trained using plaintext inputs.
Quantization. The trained model is then converted into an integer equivalent using quantization, which can happen either during training (Quantization-Aware Training) or after training (Post-Training Quantization).
Compilation. Once the model is quantized, it is compiled using Concrete's FHE compiler to produce an equivalent FHE circuit. This circuit is represented as an MLIR program consisting of low level cryptographic operations. You can read more about FHE compilation here, MLIR here and about the low-level Concrete library here.
Inference. The compiled model can then be deployed to a server and used to run private inference on encrypted inputs. You can see some examples here.
Here is an example for a simple linear regression model:
At this stage, we have everything we need to deploy the model using Client
and Server
from concrete.numpy
. Please refer to the Concrete-Numpy implementation for more information on the deployment.
The current version of Concrete only support up to 8-bits integers. This means that any floating point or large precision integer model will need to be converted to an 8-bit equivalent to be able to work with FHE. In most cases, this will require both quantization and pruning.
If you try to compile a program using more than 8 bits, the compiler will throw an error, as shown in this example:
Compiler output:
Notice that the maximum bit width, determined by the compiler, depends on the inputset passed to the compile_on_inputset
function. In this case, the error is caused by the input value in the inputset that produces a result whose representation requires 9 bits. This input is the value 8, since 8 * 42 = 336, which is a 9-bit value.
You can determine the number of bits necessary to represent an integer value with the formula:
For a more practical example, the MNIST classification task consists of taking an image, a 28x28 array containing uint8 values representing a handwritten digit, and predicting whether it belongs to one of 10 classes: the digits from 0 to 9. The output is a one-hot vector which indicates the class to which a particular sample belongs.
The input contains 28x28x8 bits, so 6272 bits of information. In practice, you could still obtain good results on MNIST by thresholding the pixels to {0, 1} and training a model for this new binarized MNIST task. This means that in a real use case where you actually need to perform digit recognition, you could binarize your input on the fly, replacing each pixel with either 0 or 1. In doing so, you use 1 bit per pixel and now only have 784 bits of input data. It also means that if you are doing some accumulation (adding pixel values together), you are going to need accumulators that are smaller (adding 0s and 1s requires less space than adding values ranging from 0 to 255). An example of MNIST classification with a quantized neural network is given in the CNN advanced example.
This shows how adapting your data or model parameters can allow you to use models that may require smaller data types (i.e. use less precision) to perform their computations.
Binarization is an extreme case of quantization which is introduced here. You can also find further resources on the linked page.
While applying quantization directly to the input features is mandatory to reduce the effective bit width of computations, a different and complementary approach is dimensionality reduction. This can be accomplished through Principal Component Analysis (PCA) as shown in the Poisson Regression example
Quantization and dimensionality reduction reduce the bit width required to run the model and increase execution speed. These two tools are necessary to make models compatible with FHE constraints.
However, quantization and, especially, binarization, induce a loss in the accuracy of the model since its representation power is diminished. Carefully choosing a quantization approach for model parameters can alleviate accuracy loss, all the while allowing compilation to FHE.
The quantization of model parameters and model inputs is illustrated in the advanced examples for Linear and Logistic Regressions. Note that different quantization parameters are used for inputs and for model weights.
Recent quantization literature usually aims to make use of dedicated machine learning accelerators in a mixed setting where a CPU or General Purpose GPU (GPGPU) is also available. Thus, in literature, some floating point computation is assumed to be acceptable. This approach allows us to reach performance similar to those achieved by floating point models. In this popular mixed float-int setting, the input is usually left in floating point. This is also true for the first and last layers, which have more impact on the resulting model accuracy than hidden layers.
However, in Concrete-ML, to respect FHE constraints, the inputs, the weights and the accumulator must all be represented with integers of a maximum of 8 bits.
Thus, in Concrete-ML, we also quantize the input data and network output activations in the same way as the rest of the network: everything is quantized to a specific number of bits. It turns out that the number of bits used for the input or the output of any activation function is crucial to comply with the constraint on accumulator width.
The core operations in neural networks are matrix multiplications (matmul) and convolutions, which both compute linear combinations of inputs (encrypted) and weights (in clear). The linear combination operation must be done such that the maximum value of its result requires at most 8 bits of precision.
Currently, Concrete-ML computes the number of bits needed for the computation depending on the inputset calibration data and does not allow the overflow to happen, raising an exception as shown previously.
The following table summarizes the various examples in this section, along with their accuracies.
Model | Dataset | Metric | Clear | Quantized | FHE |
---|---|---|---|---|---|
A * means that FHE accuracy was calculated on a subset of the validation set.
⭐️ Star the repo on Github | 🗣 Community support forum | 📁 Contribute to the project
Concrete-ML is an open-source private machine learning inference framework based on fully homomorphic encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from Scikit-learn and PyTorch (see how it looks for linear models, tree-based models and neural networks).
Fully Homomorphic Encryption (FHE) is an encryption technique that allows computating directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in this introduction, or by joining the FHE.org community.
Here is a simple example of encrypted inference using logistic regression. More examples can be found here.
This example shows the typical flow of a Concrete-ML model:
The model is trained on unencrypted (plaintext) data
The resulting model is quantized to small integers using either post-training quantization or quantization-aware training
The quantized model is compiled to a FHE equivalent (under the hood, the model is first converted to a Concrete-Numpy program, then compiled)
Inference can then be done on encrypted data
To make a model work with FHE, the only constrain is to make it run within the supported precision limitations of Concrete-ML (currently 8-bit integers).
Concrete-ML is built on top of Zama’s Concrete framework. It uses Concrete-Numpy, which itself uses the Concrete-Compiler and the Concrete-Library. To use these libraries directly, refer to the Concrete-Numpy and Concrete-Framework documentations.
Currently, Concrete only supports 8-bit encrypted integer arithmetics. This requires models to be quantized heavily, which sometimes leads to loss of accuracy vs the plaintext model. Furthermore, the Concrete-Compiler is still a work in progress, meaning it won't always find optimal performance parameters, leading to slower than expected execution times.
Additionally, Concrete-ML currently only supports FHE inference. Training on the other hand has to be done on unencrypted data, producing a model which is then converted to an FHE equivalent that can do encrypted inference.
Finally, there is currently no support for pre and post processing in FHE. Data must arrive to the FHE model already pre-processed and post-processing (if there is any) has to be done client-side.
All of these issues are currently being addressed and significant improvements are expected to be released in the coming months.
Concrete-ML provides several of the most popular linear models for regression
or classification
that can be found in Scikit-learn:
Concrete-ML | scikit-learn |
---|---|
Using these models in FHE is extremely similar to what can be done with scikit-learn's API, making it easy for data scientists that are used to this framework to get started with Concrete ML.
Models are also compatible with some of scikit-learn's main worflows, such as Pipeline()
or GridSearch()
.
Here's an example of how to use this model in FHE on a simple dataset below. A more complete example can be found in the LogisticRegression notebook.
We can then plot how the model classifies the inputs and then compare those results with a scikit-learn model executed in clear. The complete code can be found in the LogisticRegression notebook.
We can clearly observe the impact of quantization over the decision boundaries in the FHE model, breaking the initial lines into broken lines with steps. However, this does not change the overall score as both models output the same accuracy (90%).
In fact, the quantization process may sometimes create some artifacts that could lead to a decrease in performance. Still, the impact of those artifacts is often minor when considering linear models, making FHE models reach similar scores as their equivalent clear ones.
Concrete-ML provides several of the most popular tree models classification
that can be found in Scikit-learn:
Concrete-ML | scikit-learn |
---|---|
In addition to our support for scikit-learn, Concrete-ML also supports XGBoost 's XGBClassifier
:
Concrete-ML | XGboost |
---|---|
Here's an example of how to use this model in FHE on a popular dataset using some of scikit-learn's preprocessing tools. A more complete example can be found in the XGBClassifier notebook.
Using the above example, we can then plot how the model classifies the inputs and then compare those results with the XGBoost model executed in clear. A 6 bits model is also given in order to better understand the impact of quantization on classification. Similar plots can be found in the Classifier Comparison notebook.
This shows the impact of quantization over the decision boundaries in the FHE models, especially with the 3 bits model, where only three main decision boundaries can be observed. This results in a small decrease of accuracy of about 7% compared to the initial XGBoost classifier. Besides, using 6 bits of quantization makes the model reach 93% of accuracy, drastically reducing this difference to only 1.7%.
In fact, the quantization process may sometimes create some artifacts that could lead to a decrease in performance. Still, the impact of those artifacts is often minor when considering small tree-based models, making FHE models reach similar scores as their equivalent clear ones.
This section provides a set of tools and guidelines to help users build optimized FHE compatible models.
The Virtual Lib in Concrete-ML is a prototype that provides drop-in replacements for Concrete-Numpy's compiler that allow users to simulate what would happen when converting a model to FHE without the current bit width constraint, as well as quickly simulating the behavior with 8 bits or less without actually doing the FHE computations.
The Virtual Lib can be useful when developing and iterating on an ML model implementation. For example, you can check that your model is compatible in terms of operands (all integers) with the Virtual Lib compilation. Then, you can check how many bits your ML model would require, which can give you hints as to how it should be modified if you want to compile it to an actual FHE Circuit (not a simulated one) that only supports 8 bits of integer precision.
The Virtual Lib, being pure Python and not requiring crypto key generation, can be much faster than the actual compilation and FHE execution, thus allowing for faster iterations, debugging and FHE simulation, regardless of the bit width used. This was for example used for the red/blue contours in the , as computing in FHE for the whole grid and all the classifiers would take significant time.
The following example shows how to use the Virtual Lib in Concrete-ML. Simply add use_virtual_lib = True
and enable_unsafe_features = True
in a Configuration
. The result of the compilation will then be a simulated circuit that allows for more precision or simulated FHE execution.
The following example produces a neural network that is not FHE compatible:
Upon execution, the compiler will raise the following error:
Knowing that a linear/dense layer is implemented as a matrix multiplication, it can determined which parts of the op-graph listing in the exception message above correspond to which layers:
Layer weights initialization:
Input processing and quantization:
First dense layer and activation function:
Second dense layer and activation function:
Third dense layer and output quantization:
We can see here that the error is in the second layer. Reducing the number of neurons in this layer will resolve the error and make the network FHE compatible:
In FHE, univariate functions are encoded as table lookups, which are then implemented using Programmable Bootstrapping (PBS). Programmable bootstrapping is a powerful technique, but will require significantly more compute resources and thus time than more simpler encrypted operations such matrix multiplications, convolution or additions.
Furthermore, the cost of a PBS will depend on the bitwidth of the compiled circuit. Every additional bit in the maximum bitwidth raises the complexity of the PBS by a significant factor. It thus may be of interest to the model developer to determine the bitwidth of the circuit and the number of PBS it performs.
This can be done by inspecting the MLIR code produced by the compiler:
We notice that we have calls to FHELinalg.apply_mapped_lookup_table
and FHELinalg.apply_lookup_table
. These calls apply PBS to the cells of their input tensors. Their inputs in the listing above are: tensor<1x2x!FHE.eint<8>>
for the first and last call and tensor<1x50x!FHE.eint<8>>
for the two calls in the middle. Thus PBS is applied 104 times.
Getting the bitwidth of the circuit is then simply:
Decreasing the number of bits and the number of PBS induces large reductions in the computation time of the compiled circuit.
In addition to Concrete-ML models and to , it is also possible to directly compile models. This can be particularly appealing, notably to import models trained with Keras (see . It can also be interesting in the context of QAT (see ), since lot of ONNX are available on the web.
ONNX models can be compiled by performing post-training quantization (PTQ) or by directly importing models that are already quantized with quantization aware learning (QAT).
The following example shows how to compile an ONNX model using post-training quantization. The model was initially tained using Keras, before being exported to ONNX. The training code is not shown here.
While Keras was used in this example, it is not officially supported, as additional work is needed to test all of Keras' types of layer and models.
QAT models contain quantizers in the ONNX graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Since these QAT models have quantizers that are configured during training to a specific number of bits, the ONNX graph will need to be imported using the same settings:
The following operators are supported for evaluation and conversion to an equivalent FHE circuit. Other operators were not implemented either due to FHE constraints, or because they are rarely used in PyTorch activations or scikit-learn models.
Abs
Acos
Acosh
Add
Asin
Asinh
Atan
Atanh
AveragePool
BatchNormalization
Cast
Celu
Clip
Constant
Conv
Cos
Cosh
Div
Elu
Equal
Erf
Exp
Flatten
Gemm
Greater
GreaterOrEqual
HardSigmoid
HardSwish
Identity
LeakyRelu
Less
LessOrEqual
Log
MatMul
Mul
Not
Or
PRelu
Pad
Pow
ReduceSum
Relu
Reshape
Round
Selu
Sigmoid
Sin
Sinh
Softplus
Sub
Tan
Tanh
ThresholdedRelu
Transpose
Where
Pruning is a method to reduce neural network complexity, usually applied in order reduce the computation cost or memory size. Pruning is used in Concrete-ML to control the size of accumulators in neural networks, thus making them FHE compatible. See for an explanation of the accumulator bitwidth constraints.
In neural networks, a neuron computes a linear combination of inputs and learned weights, then applies an activation function.
The neuron computes:
When building a full neural network, each layer will contain multiple neurons, which are connected to the neuron outputs of a previous layer or to the inputs.
Fixing some of the weights to 0 makes the network graph look more similar to the following:
In addition to the built-in models, Concrete-ML supports generic machine learning models implemented with Torch, or .
The following example uses a simple torch model that implements a fully connected neural network with two hidden units. Due to its small size, making this model respect FHE constraints is relatively easy.
Once the model is trained, calling the compile_torch_model
from Concrete-ML will automatically perform post-training quantization and compilation to FHE. Here, we use a 3-bit quantization for both the weights and activations.
The model can now be used to perform encrypted inference. Next, the test data is quantized:
and the encrypted inference run using either:
quantized_numpy_module.forward_and_dequant()
to compute predictions in the clear, on quantized data and then de-quantize the result. The return value of this function contains the dequantized (float) output of running the model in the clear. Calling the forward function on the clear data is useful when debugging. The results in FHE will be the same as those on clear quantized data.
quantized_numpy_module.forward_fhe.encrypt_run_decrypt()
to perform the FHE inference. In this case, de-quantization is done in a second stage using quantized_numpy_module.dequantize_output()
.
While the example above shows how to import a floating point model for post-training quantization, Concrete-ML also provides an option to import quantization aware trained (QAT) models.
Suppose that n_bits_qat
is the bitwidth of activations and weights during the QAT process. To import a torch QAT network you can use the following library function:
Concrete-ML supports a variety of torch operators that can be used to build fully connected or convolutional neural networks, with normalization and activation layers. Moreover, many element-wise operators are supported.
Note that the equivalent versions from torch.functional
are also supported.
Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers).
This means that some accuracy in the representation is lost (e.g. a simple approach is to eliminate least-significant bits), but, in many cases in machine learning, it is possible to adapt the models to give meaningful results while using these smaller data types. This significantly reduces the number of bits necessary for intermediary results during the execution of these machine learning models.
Since FHE is currently limited to 8-bit integers, it is necessary to quantize models to make them compatible. As a general rule, the smaller the precision models use, the better the FHE performance.
Let be the range of our value to quantize where is the minimum and is the maximum. To quantize a range with floating point values (in ) to integer values (in ), we first need to choose the data type that is going to be used. Concrete, the framework used by Concrete-ML, is currently limited to 8-bit integers, so this will be the value used in this example. Knowing the number of bits that can be used for a value in the range , we can compute the scale
of the quantization:
where is the number of bits (). For the sake of example, let's take .
In practice, the quantization scale is then . This means the gap between consecutive representable values cannot be smaller than , which, in turn, means there can be a substantial loss of precision. Every interval of length will be represented by a value within the range .
The other important parameter from this quantization schema is the zero point
value. This essentially brings the 0 floating point value to a specific integer. If the quantization scheme is asymmetric (quantized values are not centered in 0), the resulting integer will be in .
When using quantized values in a matrix multiplication or convolution, the equations for computing the result become more complex. The IntelLabs distiller quantization documentation provides a more of the maths to quantize values and how to keep computations consistent.
Quantization implemented in Concrete-ML is done in two ways:
The quantization is done automatically during the model's FHE compilation process. This approach requires little work by the user, but may not be a one-size-fits-all solution for all types of models. The final quantized model is FHE friendly and ready to predict over encrypted data. This approach is done using Post-Training Quantization (PTQ).
In some cases (when doing extreme quantization) PTQ is not sufficient to achieve a decent final model accuracy. Concrete-ML offer the possibility for the user to do quantization before compilation to FHE, for example using Quantization-Aware Training (QAT). This can be done by any means, including by using third-party frameworks. In this approach, the user is responsible for implementing a full-integer model respecting the FHE constraints.
Concrete-ML has support for quantized ML models as well as quantization tools. The core of this functionality is the conversion of floating point values to integers. This is done using QuantizedArray
in concrete.ml.quantization
.
n_bits
that defines the precision of the quantization
values
are floating point values that will be converted to integers
is_signed
determines if the quantized integer values should allow negative values
is_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
It is also possible to use symmetric quantization, where the integer values are centered around 0:
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in the same way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
The models implemented in Concrete-ML provide features to let the user quantize the input data and dequantize the output data.
Here is a simple example showing how to perform inference, starting from float values and ending up with float values. Note that the FHE engine that is compiled for the ML models does not support data batching.
The functions quantize_input
and dequantize_output
make use of QuantizedArray
described above. When the ML model quantized_module
is calibrated, the min and max of the value distributions will be stored and applied to quantize/dequantize new data.
In the following example, QuantizedArray
is used in a different way, using pre-quantized integer values and having the scale
and zero-point
set explicitly from calibration parameters. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
There are 3 types of operators:
Operators that perform linear combinations of encrypted and constant (clear) values. For example: matrix multiplication, convolution, addition
Operators that perform element-wise operations between two encrypted tensors. For example: addition
Element-wise, fixed-function operators which can be: addition with a constant, activation functions
The following example shows how to use the _prepare_inputs_with_constants
helper function with quantize_actual_values=True
to apply the quantization function to the input data of the Gemm
operator. Since the quantization function uses floats and a non-linear function (round
), a TLU will automatically be generated together with quantization.
This section includes a complete example of a neural network in Torch, as well as links to additional examples.
In this example, we will train a fully-connected neural network on a synthetic 2D dataset with a checkerboard grid pattern of 100 x 100 points. The data is split into 9500 training and 500 test samples.
This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bit accuracy remains low irrespective of to the number of neurons. While all the configurations tried here were FHE compatible (accumulator < 8 bits), it is sometimes preferable to have lower accumulator size in order for the inference time to be faster.
The accumulator size is determined by Concrete Numpy as being the maximum bitwidth encountered anywhere in the encrypted circuit
Considering that FHE only works with limited integer precision, there is a risk of overflowing in the accumulator, resulting in unpredictable results.
The following code shows how to use pruning in our previous example:
Results with PrunedSimpleNet
, a pruned version of the SimpleNet
with 100 neurons on the hidden layers are given below:
This shows that the fp32 accuracy has been improved while maintaining constant mean accumulator size.
When pruning a larger neural network during training, it is easier to obtain a low a bitwidth accumulator while maintaining better final accuracy. Thus, pruning is more robust than training a similar smaller network.
The quantization-aware training (QAT) import tool in Concrete-ML is a work in progress. While it has been tested with some networks built with Brevitas, it is possible to use other tools to obtain QAT networks.
Training this network with 30 non-zero neurons out of 100 total gives good accuracy while being FHE compatible (accumulator size < 8 bits).
The torch QAT training loop is the same as the standard floating point training loop, but hyperparameters such as learning rate might need to be adjusted.
Quantization Aware Training is somewhat slower thant normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs as its computational intensity is low with respect to data transfer time.
The following table summarizes the examples in this section.
In this table, ** means that the accuracy is actually random-like, because the quantization we need to set to fullfil bitwidth constraints is too strong.
While this can seem like a major limitation, in practice machine learning models have features that use only a limited range of values. For example, if a feature takes a value that is limited to the range , in floating point this value is represented as where is a number between 1 and 2. Generic floating point representation can support exponents between -126 and 127, allocating 8 bits to store the exponent. In our case, a single exponent value of 0 is needed. Knowing that, for our range, the exponent can only take a single value out for 253 possible ones. We can thus save the 8 bits allocated to the exponent, reducing the bit width necessary.
For example, if you quantize your input and weights with , bits of precision, one can compute the maximum dimensionality of the input and weights before the matmul/convolution result could exceed the 8 bits as such:
where is the maximum precision allowed. For example, if we set and with , then we have the different inputs/weights are allowed in the linear combination.
Exceeding dimensions in the input and weights, the risk of overflow increases quickly. It may happen that for some distributions of weights and values the computation does not overflow, but the risk increases rapidly with the number of dimensions.
For every neuron shown in each layer of the figure above, the linear combinations of inputs and learned weights are computed. Depending on the values of the inputs and weights, the sum - which for Concrete-ML neural networks is computed with integers - can take a range of different values.
To respect the bit width constraint of the FHE , the values of the accumulator must remain small to be representable with only 8 bits. In other words, the values must be between 0 and 255.
Pruning a neural network entails fixing some of the weights to be zero during training. This is advantageous to meet FHE constraints, as irrespective of the distribution of , multiplying these input values by 0 does not increase the accumulator value.
While pruning weights can reduce the prediction performance of the neural network, studies show that a high level of pruning (above 50%) can often be applied. See here how Concrete-ML uses pruning in .
QAT models contain quantizers in the torch graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Torch quantizers are not included in Concrete-ML, so you can either implement your own or use a 3rd party library such as as shown in the . Custom models can have a more generic architecture and training procedure than the Concrete-ML built-in models.
-- sometimes accuracy issues
-- sometimes accuracy issues
-- partial support
The QuantizedArray
class takes several arguments that determine how float values are quantized (see the for more information):
Intermediary values computed during model inference might need to be re-scaled into the quantized domain of a subsequent model operator. For example, the output of a convolution layer in a neural network might have values that are 8 bits wide, with the next convolutional layer requiring that its inputs are at most 2 bits wide. In the non-encrypted realm, this implies that we need to make use of floating point operations. To make this work with integers as required by FHE, Concrete-ML uses a table lookup (TLU), which is a . Table lookups are expensive in FHE, and so should only be used when necessary.
The operations done by the activation function of a previous layer and additional re-scaling to the new quantized domain, which are all floating point operations, . Concrete-ML implements quantized operators that perform this fusion, significantly reducing the number of TLUs necessary to perform inference.
TLU generation for element-wise operations can be delegated to Concrete-Numpy directly by calling the function's corresponding NumPy implementation, as defined in .
IntelLabs distiller explanation of quantization:
Lei Mao's blog on quantization:
Google paper on neural network quantization and integer-only inference:
This network was trained using different numbers neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size shown below was extracted using the .
neurons | 10 | 30 | 100 |
---|
To understand how to overcome this limitation, consider a scenario where 2 bits are used for weights and layer inputs/outputs. The Linear
layer computes a dot product between weights and inputs . With 2 bits, no overflow can occur during the computation of the Linear
layer as long the number of neurons does not exceed 14, i.e. the sum of 14 products of 2-bit numbers does not exceed 7 bits.
By default, Concrete-ML uses symmetric quantization for model weights, with values in the interval . For example, for the possible values are , for the values can be .
However, in a typical setting, the weights will not all have the maximum or minimum value (e.g. ). Instead, weights typically have a normal distribution around 0, which is one of the motivating factors for their symmetric quantization. A symmetric distribution and many zero-valued weights are desirable because opposite sign weights can cancel each other out and zero weights do not increase the accumulator size.
This can be leveraged to train network with more neurons, while not overflowing the accumulator, using a technique called , where the developer can impose a number of zero-valued weights. Torch out of the box.
non-zero neurons | 10 | 30 |
---|
While pruning helps maintain the post-quantization level of accuracy in low-precision settings, it does not help maintain accuracy when quantizing from floating point models. The best way to guarantee accuracy is to use quantization-aware training (read more in the ).
In this example, QAT is done using , changing Linear
layers to QuantLinear
and adding quantizers on the inputs of linear layers using QuantIdentity.
non-zero neurons | 30 |
---|
Model | Dataset | Metric | Clear | Quantized | FHE |
---|
Linear Regression
Synthetic 1D
R2
0.876
0.863
0.863
Logistic Regression
Synthetic 2D with 2 classes
accuracy
0.90
0.875
0.875
Poisson Regression
mean Poisson deviance
0.61
0.60
0.60
Gamma Regression
mean Gamma deviance
0.45
0.45
0.45
Tweedie Regression
mean Tweedie deviance (power=1.9)
33.42
34.18
34.18
Decision Tree
precision score
0.95
0.97
0.97*
XGBoost
MCC
0.48
0.52
0.52*
fp32 accuracy | 68.70% | 83.32% | 88.06% |
3bit accuracy | 56.44% | 55.54% | 56.50% |
mean accumulator size | 6.6 | 6.9 | 7.4 |
fp32 accuracy | 82.50% | 88.06% |
3bit accuracy | 57.74% | 57.82% |
mean accumulator size | 6.6 | 6.8 |
3bit accuracy brevitas | 94.4% |
3bit accuracy in Concrete-ML | 91.8% |
accumulator size | 7 |
Concrete-ML provides functionality to deploy FHE machine learning models in a client/server setting. The deployment workflow and model serving follows the following pattern:
The training of the model and its compilation to FHE are performed on a development machine. Three different files are created when saving the model:
client.json
; contains the secure cryptographic parameters needed for the client to generate the private and evaluation keys
server.json
; contains the compiled model. This file is sufficient to run the model on a server.
serialized_processing.json
; contains the metadata about the pre and post processing, such as quantization parameters to quantize the input and dequantize the output.
The compiled model (server.zip
) is deployed to a server and the cryptographic parameters (client.zip
) along with the model meta data (serialized_processing.json
) are shared with the clients.
The client obtains the cryptographic parameters (using client.zip
) and generates a private encryption/decryption key as well as a set of public evaluation keys. The public evaluation keys are then sent to the server, while the secret key remains on the client.
The private data is then encrypted using serialized_processing.json
by the client and sent to the server. Server-side, the FHE model inference is ran on the encrypted inputs using the public evaluation keys.
The encrypted result is then returned by the server to the client, which decrypts it using its private key. Finally, the client performs any necessary post-processing of the decrypted result using serialized_processing.json
.
For a complete example, see this notebook
Concrete-ML uses Skorch to implement multi-layer, fully-connected torch neural networks in a way that is compatible with the Scikit-learn API.
This wrapper implements Torch training boilerplate code, alleviating the work that needs to be done by the user. It is possible to add hooks during the training phase, for example once an epoch is finished.
Skorch allows the user to easily create a classifier or regressor around a neural network (NN), implemented in Torch as a nn.Module
, which is used by Concrete-ML to provide a fully-connected multi-layer NN with a configurable number of layers and optional pruning (see pruning and the neural network documentation for more information).
Under the hood, Concrete-ML uses a Skorch wrapper around a single torch module, SparseQuantNeuralNetImpl
. More information can be found in the API guide.
A linear or convolutional layer of an NN will compute a linear combination of weights and inputs (also called a 'multi-sum'). For example, a linear layer will compute:
where is the k-th neuron in the layer. In this case, the sum is taken on a single dimension. A convolutional layer will compute:
where is the k-th filter of the convolutional layer and , and are the number of input channels, the kernel height and the kernel width, respectively.
Following the formulas for the resulting bit width of quantized linear combinations described here, it can be seen that the maximum dimensionality of the input and weights that can make the result exceed 8 bits:
Here, is the maximum precision allowed.
For example, if and with , the worst case is where all inputs and weights are equal to their maximal value . In this case, there can be at most elements in the multi-sums.
In practice, the distribution of the weights of a neural network is Gaussian, with many weights either 0 or having a small value. This enables exceeding the worst-case number of active neurons without having to risk overflowing the bitwidth. The parameter n_hidden_neurons_multiplier
is multiplied with to determine the total number of non-zero weights that should be kept in a neuron.
The pruning mechanism is already implemented in SparseQuantNeuralNetImpl
, and the user only needs to determine the parameters listed above. They can be chosen in a way that is convenient, e.g. maximizing accuracy.
Increasing n_hidden_neurons_multiplier
can lead to improved performance, as long as the compiled NN does not exceed 8 bits of accumulator bitwidth.
Hummingbird is a third party open-source library that converts machine learning models into tensor computations. Many algorithms (see supported algorithms) are converted using a specific backend (torch, torchscript, ONNX and TVM).
Concrete-ML allows the conversion of an ONNX inference to NumPy inference (note that NumPy is always the entry point to run models in FHE with Concrete ML).
Hummingbird exposes a convert
function that can be imported as follows from the hummingbird.ml
package:
This function can be used to convert a machine learning model to an ONNX as follows:
In theory, the resulting onnx_model
could be used directly within Concrete-ML's get_equivalent_numpy_forward
method (as long as all operators present in the ONNX model are implemented in NumPy) and get the NumPy inference.
In practice, there are some steps needed to clean the ONNX output and make the graph compatible with Concrete-ML, such as applying quantization where needed or deleting/replacing non-FHE friendly ONNX operators (such as Softmax and ArgMax).
Concrete-ML implements machine model inference using Concrete-Numpy as a backend. In order to execute in FHE, a numerical program written in Concrete-Numpy needs to be compiled. This functionality is described here, and Concrete-ML hides away most of the complexity of this step. The entire compilation process is done by Concrete-Numpy.
From the perspective of the Concrete-ML user, the compilation process performed by Concrete-Numpy can be broken up into 3 steps:
numpy program tracing and creation of a Concrete-Numpy op-graph
checking that the op-graph is FHE compatible
producing machine code for the op-graph. This step automatically determines cryptographic parameters
Additionally, the client/server API packages the result of the last step in a way that allows to deploy the encrypted circuit to a server and to perform key generation, encryption and decryption on the client side.
The first step in the list above takes a python function implemented using the Concrete-Numpy supported operation set and transforms it into an executable operation graph. In this step all the floating point subgraphs in the op-graph are fused and converted to Table Lookup operations.
This enables to:
execute the op-graph, which includes TLUs, on clear non-encrypted data. This is, of course, not secure, but is much faster than executing in FHE. This mode is useful for debugging. This is called the Virtual Library.
verify the maximum bitwidth of the op-graph, to determine FHE compatibility, without actually compiling the circuit to machine code. This feature is available through Concrete-Numpy and is part of the overall FHE Assistant.
The second step of compilation checks that the maximum bitwidth encountered anywhere in the circuit is valid.
If the check fails for a machine learning model, the user will need to tweak the available quantization, pruning and model hyperparameters in order to obtain FHE compatibility. The Virtual Library is useful in this setting, as described in the debugging models section.
Finally, the FHE compatible op-graph and the necessary cryptographic primitives from Concrete-Framework are converted to machine code.
Before you start this section, you must install Docker by following this official guide.
Once you have access to this repository and the dev environment is installed on your host OS (via make setup_env
once you followed the steps here), you should be able to launch the commands to build the dev Docker image with make docker_build
.
Once you do that, you can get inside the Docker environment using the following command:
After you finish your work, you can leave Docker by using the exit
command or by pressing CTRL + D
.
Documentation with GitBook is done mainly by pushing content on GitHub. GitBook then pulls the docs from the repository, and publishes.. In most cases, GitBook is just a reflexion of what is available in GitHub.
There are however some use-cases where we want to modify documentation directly in GitBook (and then, push the modifications to GitHub), for example when the documentation is modified by a person outside of our organization. In this case, a GitHub branch is created, and a GitHub space is associated to it: modifications are done in this space, and automatically pushed to the branch. Once the modifications are done, one can simply create a pull-request, to finally merge modifications on the main branch.
Documenation can alternatively be built using Sphinx:
The documentation contains both files written by hand by developers (the .md files) and files automatically created by parsing the source files.
Then to open it go to docs/_build/html/index.html
or use the follwing command:
To build and open the docs at the same time, use:
Concrete-ML is currently in beta, and thus may contains bugs or suboptimal APIs.
Before opening an issue or asking for support, we encourage the users to read this documentation to understand common issues and limations of Concrete-ML, as well as checking the outstanding issues on github.
Furthermore, undefined behavior may occur if inputset, which is internally used by the compilation core to set bit widths of some intermediate data, is not sufficiently representative of the future user inputs. With all the inputs in the inputset, it appears that intermediate data can be represented as an n-bit integer. But, for a particular computation, this same intermediate data needs additional bits to be represented. The FHE execution for this computation will result in an incorrect output, as typically occurs in integer overflows in classical programs.
If you didn't find an answer, you can ask a question on the Zama forum, or in the FHE.org discord.
When submitting an issue (here), make sure you can isolate and reproduce the bug, and give us as much information as possible. In addition to the Python script, the following information is useful:
the reproducibility rate you see on your side
any insight you might have on the bug
any workaround you have been able to find
If you would like to contribute to project and send pull requests, take a look at the contributor guide.
There are two ways to contribute to Concrete-ML:
You can open issues to report bugs and typos and to suggest ideas.
You can ask to become an official contributor by emailing hello@zama.ai. Only approved contributors can send pull requests (PR), so please make sure to get in touch before you do!
Concrete-ML uses a consistent branch naming scheme, and you are expected to follow it as well. Here is the format, along with some examples:
e.g.
Each commit to Concrete-ML should conform to the standards of the project. You can let the development tools fix some issues automatically with the following command:
Conformance can be checked using the following command:
Your code must be well documented, containing tests and not breaking other tests:
You need to make sure you get 100% code coverage. The make pytest
command checks that by default and will fail with a coverage report at the end should some lines of your code not be executed during testing.
If your coverage is below 100%, you should write more tests and then create the pull request. If you ignore this warning and create the PR, GitHub actions will fail and your PR will not be merged.
There may be cases where covering your code is not possible (an exception that cannot be triggered in normal execution circumstances). In those cases, you may be allowed to disable coverage for some specific lines. This should be the exception rather than the rule, and reviewers will ask why some lines are not covered. If it appears they can be covered, then the PR won't be accepted in that state.
Concrete-ML uses a consistent commit naming scheme, and you are expected to follow it as well (the CI will make sure you do). The accepted format can be printed to your terminal by running:
e.g.
To learn more about conventional commits, check this page. Just a reminder that commit messages are checked in the comformance step, and rejected if they don't follow the rules.
You should rebase on top of the main
branch before you create your pull request. Merge commits are not allowed, so rebasing on main
before pushing gives you the best chance of avoiding having to rewrite parts of your PR later if some conflicts arise with other PRs being merged. After you commit your changes to your new branch, you can use the following commands to rebase:
You can learn more about rebasing here.
Before any final release, Concrete-ML contributors go through a release candidate (RC) cycle. The idea is that once the codebase and documentations look ready for a release, you create an RC release by opening an issue with the release template here, starting with version vX.Y.Zrc1
and then with versions vX.Y.Zrc2
, vX.Y.Zrc3
...
Once the last RC is deemed ready, open an issue with the release template using the last RC version from which you remove the rc?
part (i.e. v12.67.19
if your last RC version was v12.67.19-rc4
) on github.
Concrete-ML is a Python
library, so Python
should be installed to develop Concrete-ML. v3.8
and v3.9
are the only supported versions. Concrete-ML also uses Poetry
and Make
.
First of all, you need to git clone
the project:
Some tests require files tracked by git-lfs to be downloaded. To do so please follow the instructions on git-lfs website then run git lfs pull
.
A simple way to have everything installed is to use the development Docker (see the docker setup guide). On Linux and macOS you have to run the script in ./script/make_utils/setup_os_deps.sh
. Specify the --linux-install-python
flag if you want to install python3.8 as well on apt-enabled Linux distributions. The script should install everything you need for Docker and bare OS development (you can first check the content of the file to check what it will do).
For Windows users, the setup_os_deps.sh
script does not install dependencies because of how many different installation methods there are/lack of a single package manager.
The first step is to install Python (as some of our dev tools depend on it), then Poetry. In addition to installing Python, you are still going to need the following software available on path on Windows, as some of our basic dev tools depend on them:
Development on Windows only works with the Docker environment. Follow this link to setup the Docker environment.
To manually install Python, you can follow this guide (alternatively, you can google how to install Python 3.8 (or 3.9)
).
Poetry
is ised as the package manager. It drastically simplifies dependency and environment management. You can follow this official guide to install it.
As there is no concrete-compiler
package for Windows, only the dev dependencies can be installed. This requires Poetry >= 1.2.
At the time of writing (June 2022), there is only an alpha version of Poetry 1.2 that you can install. Use the official installer to install preview versions.
The dev tools use make
to launch the various commands.
On Linux, you can install make
from your distribution's preferred package manager.
On macOS, you can install a more recent version of make
via brew:
It is possible to install gmake
as make
. Check this StackOverflow post for more info.
On Windows, check this GitHub gist.
In the following sections, be sure to use the proper make
tool for your system: make
, gmake
, or other.
To get the source code of Concrete-ML, clone the code repository using the link for your favourite communication protocol (ssh or https).
We are going to make use of virtual environments. This helps to keep the project isolated from other Python
projects in the system. The following commands will create a new virtual environment under the project directory and install dependencies to it.
The following command will not work on Windows if you don't have Poetry >= 1.2.
Finally, activate the newly created environment using the following command:
Docker automatically creates and sources a venv in ~/dev_venv/
The venv persists thanks to volumes. We also create a volume for ~/.cache to speed up later reinstallations. You can check which Docker volumes exist with:
You can still run all make
commands inside Docker (to update the venv, for example). Be mindful of the current venv being used (the name in parentheses at the beginning of your command prompt).
After your work is done, you can simply run the following command to leave the environment:
From time to time, new dependencies will be added to the project or the old ones will be removed. The command below will make sure the project has the proper environment. So run it regularly!
If you are having issues, consider using the dev Docker exclusively (unless you are working on OS specific bug fixes or features).
Here are the steps you can take on your OS to try and fix issues:
At this point, you should consider using Docker as nobody will have the exact same setup as you. If, however, you need to develop on your OS directly, you can ask us for help but may not get a solution right away.
Here are the steps you can take in your Docker to try and fix issues:
If the problem persists at this point, you should ask for help. We're here and ready to assist!
Fully Connected NN | accuracy | 0.947 | 0.895 | 0.895 |
QAT Fully Connected NN | Synthetic (Checkerboard) | accuracy | 0.94 | 0.94 | 0.94 |
Convolutional NN | accuracy | 0.90 | ** | ** |
Internally, Concrete-ML uses ONNX operators as intermediate representation (or IR) for manipulating machine learning models produced through export for PyTorch, Hummingbird and skorch.
As ONNX is becoming the standard exchange format for neural networks, this allows Concrete-ML to be flexible while also making model representation manipulation quite easy. In addition, it allows for straight-forward mapping to NumPy operators, supported by Concrete-Numpy to use the Concrete stack FHE conversion capabilities.
The diagram below gives an overview of the steps involved in the conversion of an ONNX graph to a FHE compatible format, i.e. a format that can be compiled to FHE through Concrete-Numpy.
All Concrete-ML builtin models follow the same pattern for FHE conversion:
The models are trained with sklearn or torch
All models have a torch implementation for inference. This implementation is provided either by third-party tool such as hummingbird, or is implemented in Concrete-ML.
The torch model is exported to ONNX. For more information on the use of ONNX in Concrete-ML see here
The Concrete-ML ONNX parser checks that all the operations in the ONNX graph are supported and assigns reference numpy operations to them. This step produces a NumpyModule
Quantization is performed on the NumpyModule
, producing a QuantizedModule
. Two steps are performed: calibration and assignment of equivalent QuantizedOp
objects to each ONNX operation. The QuantizedModule
class is the quantized counterpart of the NumpyModule
.
Once the QuantizedModule
is built, Concrete-Numpy is used to trace the ._forward()
function of the QuantizedModule
Moreover, by passing a user provided nn.Module
to step 2 of the above process, Concrete-ML supports custom user models. See the associated FHE-friendly model documentation for instructions about working with such models.
Once an ONNX model is imported, it is converted to a NumpyModule
, then to a QuantizedModule
and, finally, to an FHE circuit. However, as the diagram shows, it is perfectly possible to stop at the NumpyModule
level if you just want to run the torch model as NumPy code without doing quantization.
Note that if you keep the obtained NumpyModule
without quantizing it with Post Training Quantization (PTQ), it will not be convertible to FHE since the Concrete stack requires operators to use integers for computations.
The NumpyModule
stores the ONNX model that it interprets. The interpreter works by going through the ONNX graph in topological order, and storing the intermediate results as it goes. To execute a node, the interpreter feeds the required inputs - taken either from the model inputs or the intermediate results - to the NumPy implementation of each ONNX node.
Calibration is the process of executing the NumpyModule
with a representative set of data, in floating point. It allows to compute statistics for all the intermediate tensors used in the network to determine quantization parameters.
Note that the NumpyModule
interpreter currently supports the following ONNX operators.
Quantization is the process of converting floating point weights, inputs and activations to integer, according to the quantization parameters computed during Calibration.
Initializers (model trained parameters) are quantized according to n_bits
and passed to the Post Training Quantization (PTQ) process.
During the PTQ process, the ONNX model stored in the NumpyModule
is interpreted and calibrated using ONNX_OPS_TO_QUANTIZED_IMPL
dictionary, which maps ONNX operators (e.g. Gemm) to their quantized equivalent (e.g. QuantizedGemm). For more information on implementing these operations, please see the FHE compatible op-graph section.
Quantized operators are then used to create a QuantizedModule
that, similarly to the NumpyModule
, runs through the operators to perform the quantized inference with integers-only operations.
That QuantizedModule
is then compilable to FHE if the intermediate values conform to the 8 bits precision limit of the Concrete stack.
In order to better understand how Concrete-ML works under the hood, it is possible to access each model in their ONNX format and then either either print it or visualize it by importing the associated file in Netron. For example, with LogisticRegression
:
The ONNX import section gave an overview of the conversion of a generic ONNX graph to an FHE compatible Concrete-ML op-graph. This section describes the implementation of the operations in the Concrete-ML op-graph and the way floating point can be used in some parts of the op-graphs through table lookup operations.
Concrete, the underlying implementation of TFHE that powers Concrete-ML, enables two types of operations on integers:
arithmetic operations: __ addition of two encrypted values and multiplication of encrypted values with clear scalars. These are used for example in dot-products, matrix multiplication (linear layers), and convolution
table lookup operations (TLU): using an encrypted value as an index, return the value of a lookup table at that index. This is implemented using Programmable Bootstrapping (PBS). This operation is used to perform any non-linear computation such as activation functions, quantization, normalization
Since machine learning models use floating point inputs and weights, they first need to be converted to integer using quantization.
Alternatively, it is possible to use a table lookup to avoid the quantization of the entire graph, by converting floating-point ONNX subgraphs into lambdas, and computing their corresponding lookup tables to be evaluated directly in FHE. This operator fusion technique only requires the input and output of the lambdas to be integers.
For example in the following graph, there is a single input, which must be an encrypted integer tensor. The following series of univariate functions is then fed into a matrix multiplication (MatMul) and fused into a single table lookup with integer inputs and outputs.
Concrete-ML implements ONNX operations using Concrete-Numpy, which can handle floating point operations, as long as they can be fused to an integer lookup table. The ONNX operations implementations are based on the QuantizedOp
class.
There are two modes of creation of a single table lookup for a chain of ONNX operations:
float mode: when the operation can be fused
mixed float/integer: when the ONNX operation needs to perform arithmetic operations
Thus, QuantizedOp
instances may need to quantize their inputs or the result of their computation, depending on their position in the graph.
The QuantizedOp
class provides a generic implementation of an ONNX operation, including quantization of inputs and outputs, with the computation implemented in numpy in ops_impl.py
. We can picture at the architecture of the QuantizedOp
as the following structure:
This figure shows that the QuantizedOp
has a body that implements the computation of the operation, following the ONNX spec. The operation's body can take either integer or float inputs and can output float or integer values. Two quantizers are attached to the operation, one that takes float inputs and produces the integer inputs, and one that does the same for the output.
Depending on the position of the op in the graph and its inputs, the QuantizedOp
can be fully fused to a TLU.
Many ONNX ops are trivially univariate, as they multiply variable inputs with constants, or apply univariate functions such as ReLU, Sigmoid, etc. This includes operations between the input and the MatMul in the graph above (subtraction, comparison, multiplication, etc. between inputs and constants).
Operations such as matrix multiplication of encrypted inputs with a constant matrix, or convolution with constant weights, require that the encrypted inputs be integers. In this case, the input quantizer of the QuantizedOp
is applied. These types of operation are implemented with a class that derives from QuantizedOp
and implements q_impl
, such as QuantizedGemm
and QuantizedConv
.
Finally, some operations produce graph outputs which must be integers. These operations thus need to quantize their outputs as follows:
The diagram above shows that both float ops and integer ops need to quantize their outputs to integer, when placed at the end of the graph.
To chain the operation types described above, following the ONNX graph, Concrete-ML constructs a function that calls the q_impl
of the QuantizedOp
instances in the graph in sequence, and uses Concrete-Numpy to trace the execution and compile to FHE. Thus, in this chain of function calls, all groups of that instructions that operate in floating point will be fused to table lookups (TLUs). In FHE this lookup table is computed with a PBS.
The red contours show the groups of elementary Concrete Numpy instructions that will be converted to TLUs.
Note that the input is slightly different from the QuantizedOp
. Since the encrypted function takes integers as inputs, the input needs to be dequantized first.
QuantizedOp
QuantizedOp
is the base class for all ONNX quantized operators. It abstracts away many things to allow easy implementation of new quantized ops.
The QuantizedOp
class exposes a function can_fuse
that
helps to determine the type of implementation that will be traced
determines whether operations further in the graph, that depend on the results of this operation, can fuse
In most case ONNX ops have a single variable input and one or more constant inputs.
When the op implements elementwise operations between the inputs and constants (addition, subtract, multiplication, etc), the operation can be fused to a TLU. Thus, by default in QuantizedOp
the can_fuse
function returns True
.
When the op implements operations that mix the various scalars in the input encrypted tensor, the operation can not fuse, as table lookups are univariate. Thus operations such as QuantizedGemm
, QuantizedConv
return False
in can_fuse
.
Some operations may be found in both settings above. A mechanism is implemented in Concrete-ML to determine if the inputs of a QuantizedOp
are produced by a unique integer tensor. Thus, the can_fuse
function of some QuantizedOp
types (addition, subtraction) will allow fusion to take place if both operands are produced by a unique integer tensor:
You can check ops_impl.py
to see how some operations are implemented in with NumPy. The declaration convention for these operations is as follows:
The required inputs should be positional arguments only before the /
, which marks the limit of the positional arguments
The optional inputs should be positional or keyword arguments between the /
and *
, which marks the limits of positional or keyword arguments
The operator attributes should be keyword arguments only after the *
The proper use of positional/keyword arguments is required to allow the QuantizedOp
class to properly populate metadata automatically. It uses Python inspect modules and stores relevant information for each argument related to its positional/keyword status. This allows using the Concrete-NumPy implementation as specifications for QuantizedOp
, which removes some data duplication and allows having a single source of truth for QuantizedOp
and ONNX NumPy implementations.
In that case (unless the quantized implementation requires special handling like QuantizedGemm
), you can just set _impl_for_op_named
to the name of the ONNX op for which the quantized class is implemented (this uses the mapping ONNX_OPS_TO_NUMPY_IMPL
in onnx_utils.py
to get the correct implementation).
Providing an integer implementation requires sub-classing QuantizedOp
to create a new operation. This sub-class must override q_impl
in order to provide an integer implementation. QuantizedGemm
is an example of such a case where quantized matrix multiplication requires proper handling of scales and zero points. The q_impl
of that class reflects that.
In the body of q_impl
, in order to obtain quantized integer values you can use the _prepare_inputs_with_constants
function as such:
Here, prepared_inputs
will contain one or more QuantizedArray
of which the qvalues
are the quantized integers.
Once the required integer processing code in implemented, the output of the q_impl
function must be a implemented as a single QuantizedArray
. Most commonly, this is built using the dequantized results of the processing done in q_impl
.
In this case, in q_impl
you can check wether the current operation can be fused by calling self.can_fuse()
. You can then have both a floating point and an integer implementation, the traced execution path will depend on can_fuse()
:
The APIs of the project are detailed in a .