Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Please note that not all hardware/OS combinations are supported. Determine your platform, OS version, and Python version before referencing the table below.
Depending on your OS, Concrete-ML may be installed with Docker or with pip:
Linux
Yes
Yes
Windows
Yes
Not currently
Windows Subsystem for Linux
Yes
Yes
macOS (Intel)
Yes
Yes
macOS (Apple Silicon, ie M1, M2 etc)
Yes
Not currently
Also, only some versions of python
are supported: in the current release, these are 3.7
(Linux only), 3.8
, and 3.9
. Moreover, the Concrete-ML Python package requires glibc >= 2.28
. On Linux, you can check your glibc
version by running ldd --version
.
Concrete-ML can be installed on Kaggle (see question on community for more details), but not on Google Colab (see question on community for more details).
Most of these limits are shared with the rest of the Concrete stack (namely Concrete-Numpy and Concrete-Compiler). Support for more platforms will be added in the future.
Installing Concrete-ML using PyPi requires a Linux-based OS or macOS running on an x86 CPU. For Apple Silicon, Docker is the only currently supported option (see below).
Installing on Windows can be done using Docker or WSL. On WSL, Concrete-ML will work as long as the package is not installed in the /mnt/c/ directory, which corresponds to the host OS filesystem.
To install Concrete-ML from PyPi, run the following:
This will automatically install all dependencies, notably Concrete-Numpy.
Concrete-ML can be installed using Docker by either pulling the latest image or a specific version:
The image can be used with Docker volumes, see the Docker documentation here.
The image can then be used via the following command:
This will launch a Concrete-ML enabled Jupyter server in Docker that can be accessed directly from a browser.
Alternatively, a shell can be lauched in Docker, with or without volumes:
⭐️ Star the repo on Github | 🗣 Community support forum | 📁 Contribute to the project
Concrete-ML is an open-source, privacy-preserving, machine learning inference framework based on fully homomorphic encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from Scikit-learn and PyTorch (see how it looks for linear models, tree-based models, and neural networks).
Fully Homomorphic Encryption (FHE) is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in this introduction or by joining the FHE.org community.
Here is a simple example of classification on encrypted data using logistic regression. More examples can be found here.
This example shows the typical flow of a Concrete-ML model:
The model is trained on unencrypted (plaintext) data using scikit-learn. As FHE operates over integers, Concrete-ML quantizes the model to use only integers during inference.
The quantized model is compiled to a FHE equivalent. Under the hood, the model is first converted to a Concrete-Numpy program, then compiled.
Inference can then be done on encrypted data. The above example shows encrypted inference in the model-development phase. Alternatively, during deployment in a client/server setting, the data is encrypted by the client, processed securely by the server, and then decrypted by the client.
To make a model work with FHE, the only constraint is to make it run within the supported precision limitations of Concrete-ML (currently 16-bit integers). Thus, machine learning models are required to be quantized, which sometimes leads to a loss of accuracy versus the original model, which operates on plaintext.
Additionally, Concrete-ML currently only supports FHE inference. On the other hand, training has to be done on unencrypted data, producing a model which is then converted to a FHE equivalent that can perform encrypted inference, i.e. prediction over encrypted data.
Finally, in Concrete-ML there is currently no support for pre-processing model inputs and post-processing model outputs. These processing stages may involve text-to-numerical feature transformation, dimensionality reduction, KNN or clustering, featurization, normalization, and the mixing of results of ensemble models.
All of these issues are currently being addressed and significant improvements are expected to be released in the coming months.
Concrete-ML is built on top of Zama's Concrete framework. It uses Concrete-Numpy, which itself uses the Concrete-Compiler and the Concrete-Library. To use these libraries directly, refer to the Concrete-Numpy and Concrete-Framework documentations.
Various tutorials are available for the built-in models and for deep learning. In addition, several standalone demos for use-cases can be found in the Demos and Tutorials section.
If you have built awesome projects using Concrete-ML, feel free to let us know and we'll link to your work!
Support forum: https://community.zama.ai (we answer in less than 24 hours).
Live discussion on the FHE.org Discord server: https://discord.fhe.org (inside the #concrete channel).
Do you have a question about Zama? You can write us on Twitter or send us an email at: hello@zama.ai
Concrete-ML models can be easily deployed in a client/server setting, enabling the creation of privacy-preserving services in the cloud.
As seen in the concepts section, a Concrete-ML model, once compiled to FHE, generates machine code that performs the inference on private data. Furthermore, secret encryption keys are needed so that the user can securely encrypt their data and decrypt the inference result. An evaluation key is also needed for the server to securely process the user's encrypted data.
Keys are generated by the user once for each service they use, based on the model the service provides and its cryptographic parameters.
The overall communications protocol to enable cloud deployment of machine learning services can be summarized in the following diagram:
The steps detailed above are as follows:
The model developer deploys the compiled machine learning model to the server. This model includes the cryptographic parameters. The server is now ready to provide private inference.
The client requests the cryptographic parameters (also called "client specs"). Once it receives them from the server, the secret and evaluation keys are generated.
The client sends the evaluation key to the server. The server is now ready to accept requests from this client. The client sends their encrypted data.
The server uses the evaluation key to securely run inference on the user's data and sends back the encrypted result.
The client now decrypts the result and can send back new requests.
For more information on how to implement this basic secure inference protocol, refer to the Production Deployment section and to the client/server example.
This section lists several demos that apply Concrete-ML to some popular machine learning problems. They show how to build ML models that perform well under FHE constraints, and then how to perform the conversion to FHE.
Simpler tutorials that discuss only model usage and compilation are also available for the built-in models and for deep learning.
This graph above shows that, when using a sufficiently high bit-width, quantization has little impact on the decision boundaries of the Concrete-ML FHE decision tree models. As the quantization is done individually on each input feature, the impact of quantization is strongly reduced, and, thus, FHE tree-based models reach similar accuracy as their floating point equivalents. Using 6 bits for quantization makes the Concrete-ML model reach or exceed the floating point accuracy. The number of bits for quantization can be adjusted through the n_bits
parameter.
When n_bits
is set low, the quantization process may sometimes create some artifacts that could lead to a decrease in performance, but the execution speed in FHE decreases. In this way, it is possible to adjust the accuracy/speed trade-off, and some accuracy can be recovered by increasing the n_estimators
.
The following graph shows that using 5-6 bits of quantization is usually sufficient to reach the performance of a non-quantized XGBoost model on floating point data. The metrics plotted are accuracy and F1-score on the spambase
data-set.
Concrete-ML provides several of the most popular classification
and regression
tree models that can be found in :
In addition to support for scikit-learn, Concrete-ML also supports 's XGBClassifier
:
Here's an example of how to use this model in FHE on a popular data-set using some of scikit-learn's pre-processing tools. A more complete example can be found in the .
In a similar example, the decision boundaries of the Concrete-ML model can be plotted, and, then, compared to the results of the classical XGBoost model executed in the clear. A 6-bit model is shown in order to illustrate the impact of quantization on classification. Similar plots can be found in the .
Concrete-ML is built on top of Concrete-Numpy, which enables Numpy programs to be converted into FHE circuits.
training: A model is trained using plaintext, non-encrypted, training data.
inference: The compiled model can then be executed on encrypted data, once the proper keys have been generated. The model can also be deployed to a server and used to run private inference on encrypted inputs.
client/server deployment: In a client/server setting, the model can be exported in a way that:
allows the client to generate keys, encrypt, and decrypt.
provides a compiled model that can run on the server to perform inference on encrypted data.
key generation: The data owner (client) needs to generate a pair of private keys (to encrypt/decrypt their data and results) and a public evaluation key (for the model's FHE evaluation on the server).
Concrete-ML and Concrete-Numpy are tools that hide away the details of the underlying cryptography scheme, called TFHE. However, some cryptography concepts are still useful when using these two toolkits:
encryption/decryption: These operations transform plaintext, i.e. human-readable information, into ciphertext, i.e. data that contains a form of the original plaintext that is unreadable by a human or computer without the proper key to decrypt it. Encryption takes plaintext and an encryption key and produces ciphertext, while decryption is the inverse operation.
encrypted inference: FHE allows a third party to execute (i.e. run inference or predict) a machine learning model on encrypted data (a ciphertext). The result of the inference is also encrypted and can only be read by the person who receives the decryption key.
keys: A key is a series of bits used within an encryption algorithm for encrypting data so that the corresponding ciphertext appears random.
key generation: Cryptographic keys need to be generated using random number generators. Their size may be large and key generation may take a long time. However, keys only need to be generated once for each model used by a client.
guaranteed correctness of encrypted computations: To achieve security, TFHE, the underlying encryption scheme, adds random noise as ciphertexts. This can induce errors during processing of encrypted data, depending on noise parameters. By default, Concrete-ML uses parameters that ensure the correctness of the encrypted computation, so there is no need to account for noise parametrization. Therefore, the results on encrypted data will be the same as the results of simulation on clear data.
To respect FHE constraints, all numerical programs that include non-linear operations over encrypted data must have all inputs, constants, and intermediate values represented with integers of a maximum of 16 bits.
quantization: The model is converted into an integer equivalent using quantization. Concrete-ML performs this step either during training (Quantization Aware Training) or after training (Post-training Quantization), depending on model type. Quantization converts inputs, model weights, and all intermediate values of the inference computation to integers. More information is available .
simulation using the Virtual Library: Testing FHE models on very large data-sets can take a long time. Furthermore, not all models are compatible with FHE constraints out-of-the-box. Simulation using the Virtual Library allows you to execute a model that was quantized, to measure the accuracy it would have in FHE, but also to determine the modifications required to make it FHE compatible. Simulation is described in more detail .
compilation: Once the model is quantized, simulation can confirm it has good accuracy in FHE. The model then needs to be compiled using Concrete's FHE compiler to produce an equivalent FHE circuit. This circuit is represented as an MLIR program consisting of low level cryptographic operations. You can read more about FHE compilation , MLIR , and about the low-level Concrete library .
You can see some examples of the model development workflow .
You can see an example of the model deployment workflow .
While Concrete-ML users only need to understand the cryptography concepts above, for a deeper understanding of the cryptography behind the Concrete stack, please see the or .
Thus, Concrete-ML quantizes the input data and model outputs in the same way as weights and activations. The main levers to control accumulator bit-width are the number of bits used for the inputs, weights, and activations of the model. These parameters are crucial to comply with the constraint on accumulator bit-widths. Please refer to for more details about how to develop models with quantization in Concrete-ML.
However, these methods may cause a reduction in the accuracy of the model since its representative power is diminished. Most importantly, carefully choosing a quantization approach can alleviate accuracy loss, all the while allowing compilation to FHE. Concrete-ML offers built-in models that already include quantization algorithms, and users only need to configure some of their parameters, such as the number of bits, discussed above. See for information about configuring these parameters for various models.
Additional specific methods can help to make models compatible with FHE constraints. For instance, dimensionality reduction can reduce the number of input features and, thus, the maximum accumulator bit-width reached within a circuit. Similarly, sparsity-inducing training methods, such as pruning, deactivate some features during inference, which also helps. For now, dimensionality reduction is considered as a pre-processing step, while pruning is used in the .
The configuration of model quantization parameters is illustrated in the advanced examples for and dimensionality reduction is shown in the .
In addition to Concrete-ML models and custom models in torch, it is also possible to directly compile ONNX models. This can be particularly appealing, notably to import models trained with Keras.
ONNX models can be compiled by directly importing models that are already quantized with Quantization Aware Training (QAT) or by performing Post-Training Quantization (PTQ) with Concrete-ML.
The following example shows how to compile an ONNX model using PTQ. The model was initially trained using Keras before being exported to ONNX. The training code is not shown here.
This example uses Post-Training Quantization, i.e. the quantization is not performed during training. Thus, this model would not have good performance in FHE. Quantization Aware Training should be added by the model developer. Additionally, importing QAT ONNX models can be done as shown below.
While Keras was used in this example, it is not officially supported as additional work is needed to test all of Keras' types of layer and models.
QAT models contain quantizers in the ONNX graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Since these QAT models have quantizers that are configured during training to a specific number of bits, the ONNX graph will need to be imported using the same settings:
The following operators are supported for evaluation and conversion to an equivalent FHE circuit. Other operators were not implemented, either due to FHE constraints or because they are rarely used in PyTorch activations or scikit-learn models.
Abs
Acos
Acosh
Add
Asin
Asinh
Atan
Atanh
AveragePool
BatchNormalization
Cast
Celu
Clip
Concat
Constant
Conv
Cos
Cosh
Div
Elu
Equal
Erf
Exp
Flatten
Floor
Gemm
Greater
GreaterOrEqual
HardSigmoid
HardSwish
Identity
LeakyRelu
Less
LessOrEqual
Log
MatMul
Max
MaxPool
Min
Mul
Neg
Not
Or
PRelu
Pad
Pow
ReduceSum
Relu
Reshape
Round
Selu
Sigmoid
Sign
Sin
Sinh
Softplus
Sub
Tan
Tanh
ThresholdedRelu
Transpose
Unsqueeze
Where
onnx.brevitas.Quant
The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy.
The model can now be used to perform encrypted inference. Next, the test data is quantized:
and the encrypted inference can be run using either:
quantized_numpy_module.forward_and_dequant()
to compute predictions in the clear on quantized data, and then de-quantize the result. The return value of this function contains the dequantized (float) output of running the model in the clear. Calling the forward function on the clear data is useful when debugging. The results in FHE will be the same as those on clear quantized data.
quantized_numpy_module.forward_fhe.encrypt_run_decrypt()
to perform the FHE inference. In this case, de-quantization is done in a second stage using quantized_numpy_module.dequantize_output()
.
While the example above shows how to import a Brevitas/PyTorch model, Concrete-ML also provides an option to import generic QAT models implemented either in PyTorch or through ONNX. Interestingly, deep learning models made with TensorFlow or Keras should be usable, by preliminary converting them to ONNX.
QAT models contain quantizers in the PyTorch graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized.
When importing QAT models using this generic pipeline, a representative calibration set should be given as quantization parameters in the model need to be inferred from the statistics of the values encountered during inference.
Concrete-ML supports a variety of PyTorch operators that can be used to build fully connected or convolutional neural networks, with normalization and activation layers. Moreover, many element-wise operators are supported.
Please note that Concrete-ML supports these operators but also the QAT equivalents from Brevitas.
brevitas.nn.QuantLinear
brevitas.nn.QuantConv2d
brevitas.nn.QuantIdentity
Note that the equivalent versions from torch.functional
are also supported.
In Concrete-ML, built-in linear models are exact equivalents to their scikit-learn counterparts. Indeed, since they do not apply any non-linearity during inference, these models are very fast (~1ms FHE inference time) and can use high precision integers (between 20-25 bits).
Tree-based models apply non-linear functions that enable comparisons of inputs and trained thresholds. Thus, they are limited with respect to the number of bits used to represent the inputs. But as these examples show, in practice 5-6 bits are sufficient to exactly reproduce the behavior of their scikit-learn counterpart models.
As shown in the examples below, built-in neural networks can be configured to work with user-specified accumulator sizes, which allow the user to adjust the speed/accuracy tradeoff.
These examples show how to use the built-in linear models on synthetic data, which allows for easy visualization of the decision boundaries or trend lines. Executing these 1D and 2D models in FHE takes around 1 millisecond.
Based on three different synthetic data-sets, all the built-in classifiers are demonstrated in this notebook, showing accuracies, inference times, accumulator bit-widths, and decision boundaries.
Pruning is used in Concrete-ML for two types of neural networks:
In neural networks, a neuron computes a linear combination of inputs and learned weights, then applies an activation function.
The neuron computes:
When building a full neural network, each layer will contain multiple neurons, which are connected to the inputs or to the neuron outputs of a previous layer.
Fixing some of the weights to 0 makes the network graph look more similar to the following:
In the formula above, in the worst case, the maximum number of the input and weights that can make the result exceed $n$ bits is given by:
Concrete-ML provides simple built-in neural networks models with a scikit-learn interface through the NeuralNetClassifier
and NeuralNetRegressor
classes.
The Concrete-ML models are multi-layer, fully-connected networks with customizable activation functions and a number of neurons in each layer. This approach is similar to what is available in scikit-learn using the MLPClassifier
/MLPRegressor
classes. The built-in models train easily with a single call to .fit()
, which will automatically quantize the weights and activations. These models use Quantization Aware Training, allowing good performance for low precision (down to 2-3 bit) weights and activations.
To create an instance of a Fully Connected Neural Network (FCNN), you need to instantiate one of the NeuralNetClassifier
and NeuralNetRegressor
classes and configure a number of parameters that are passed to their constructor. Note that some parameters need to be prefixed by module__
, while others don't. Basically, the parameters that are related to the model, i.e. the underlying nn.Module
, must have the prefix. The parameters that are related to training options do not require the prefix.
The figure above shows, on the right, the Concrete-ML neural network, trained with Quantization Aware Training, in a FHE-compatible configuration. The figure compares this network to the floating-point equivalent, trained with scikit-learn.
module__n_layers
: number of layers in the FCNN, must be at least 1. Note that this is the total number of layers. For a single, hidden layer NN model, set module__n_layers=2
module__n_outputs
: number of outputs (classes or targets)
module__input_dim
: dimensionality of the input
n_w_bits
(default 3): number of bits for weights
n_a_bits
(default 3): number of bits for activations and inputs
max_epochs
: The number of epochs to train the network (default 10)
verbose
: Whether to log loss/metrics during training (default: False)
lr
: Learning rate (default 0.001)
When you have training data in the form of a NumPy array, and targets in a NumPy 1D array, you can set:
You can give weights to each class to use in training. Note that this must be supported by the underlying PyTorch loss function.
The n_hidden_neurons_multiplier
parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing n_hidden_neurons_multiplier
improves accuracy, but should take into account precision limitations to avoid overflow in the accumulator. The default value is a good compromise that avoids overflow in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors. A value of 1 should be completely safe with respect to overflow.
Some examples constrain accumulators to 7-8 bits, which can be sufficient for simple data-sets. Up to 16-bit accumulators can be used, but this introduces a slowdown of 4-5x compared to 8-bit accumulators.
Shows how to use Quantization Aware Training and pruning when starting out from a classical PyTorch network. This example uses a simple data-set and a small NN, which achieves good accuracy with low accumulator size.
Compilation of a model produces machine code that executes the model on encrypted data. In some cases, notably in the client/server setting, the compilation can be done by the server when loading the model for serving.
As FHE execution is much slower than execution on non-encrypted data, Concrete-ML has a simulation mode, using an execution mode named the Virtual Library. Since, by default, the cryptographic parameters are chosen such that the results obtained in FHE are the same as those on clear data, the Virtual Library allows you to benchmark models quickly during development.
From the perspective of the Concrete-ML user, the compilation process performed by Concrete-Numpy can be broken up into 3 steps:
tracing the Numpy program and creating a Concrete-Numpy op-graph
checking the op-graph for FHE compatability
producing machine code for the op-graph (this step automatically determines cryptographic parameters)
The result of this single step of the compilation pipeline allows the:
verification of the maximum bit-width of the op-graph, to determine FHE compatibility, without actually compiling the circuit to machine code.
Enabling Virtual Library execution requires the definition of a compilation Configuration
. As simulation does not execute in FHE, this can be considered unsafe:
Next, the following code uses the simulation mode for built-in models:
And finally, for custom models, it is possible to enable simulation using the following syntax:
Obtaining the simulated predictions of the models using the Virtual Library has the same syntax as execution in FHE:
Moreover, the maximum accumulator bit-width is determined as follows:
This section provides a set of tools and guidelines to help users build optimized FHE-compatible models.
The Virtual Lib can be useful when developing and iterating on an ML model implementation. For example, you can check that your model is compatible in terms of operands (all integers) with the Virtual Lib compilation. Then, you can check how many bits your ML model would require, which can give you hints about ways it could be modified to compile it to an actual FHE Circuit. As FHE non-linear models work with integers up to 16 bits, with a tradeoff between number of bits and FHE execution speed, the Virtual Lib can help to find the optimal model design.
The following example shows how to use the Virtual Lib in Concrete-ML. Simply add use_virtual_lib = True
and enable_unsafe_features = True
in a Configuration
. The result of the compilation will then be a simulated circuit that allows for more precision or simulated FHE execution.
The following example produces a neural network that is not FHE-compatible:
Upon execution, the compiler will raise the following error within the graph representation:
Knowing that a linear/dense layer is implemented as a matrix multiplication, it can determine which parts of the op-graph listing in the exception message above correspond to which layers.
Layer weights initialization:
Input data:
First dense layer and activation function:
Second dense layer and activation function:
Third dense layer:
We can see here that the error is in the second layer because the product has exceeded the 16-bit precision limit. This error is only detected when the PBS operations are actually applied.
However, reducing the number of neurons in this layer resolves the error and makes the network FHE-compatible:
In FHE, univariate functions are encoded as table lookups, which are then implemented using Programmable Bootstrapping (PBS). PBS is a powerful technique but will require significantly more computing resources, and thus time, than simpler encrypted operations such as matrix multiplications, convolution, or additions.
Furthermore, the cost of PBS will depend on the bit-width of the compiled circuit. Every additional bit in the maximum bit-width raises the complexity of the PBS by a significant factor. It may be of interest to the model developer, then, to determine the bit-width of the circuit and the amount of PBS it performs.
This can be done by inspecting the MLIR code produced by the compiler:
There are several calls to FHELinalg.apply_mapped_lookup_table
and FHELinalg.apply_lookup_table
. These calls apply PBS to the cells of their input tensors. Their inputs in the listing above are: tensor<1x2x!FHE.eint<8>>
for the first and last call and tensor<1x50x!FHE.eint<8>>
for the two calls in the middle. Thus, PBS is applied 104 times.
Retrieving the bit-width of the circuit is then simply:
Decreasing the number of bits and the number of PBS applications induces large reductions in the computation time of the compiled circuit.
Models are also compatible with some of scikit-learn's main workflows, such as Pipeline()
and GridSearch()
.
The n_bits
parameter controls the bit-width of the inputs and weights of the linear models. When non-linear mapping is applied by the model, such as exp or sigmoid, currently Concrete-ML applies it on the client-side, on clear-text values that are the decrypted output of the linear part of the model. Thus, Linear Models do not use table lookups, and can, therefore, use high precision integers for weight and inputs. The n_bits
parameter can be set to 8
or more bits for models with up to 300 input dimensions. When the input has more dimensions, n_bits
must be reduced to 6-7
. Accuracy and R2 scores are preserved down to n_bits=6
, compared to the non-quantized float models from scikit-learn.
The overall accuracy scores are identical (93%) between the scikit-learn model (executed in the clear) and the Concrete-ML one (executed in FHE). In fact, quantization has little impact on the decision boundaries, as linear models are able to consider large precision numbers when quantizing inputs and weights in Concrete-ML. Additionally, as the linear models do not use PBS, the FHE computations are always exact, meaning the FHE predictions are always identical to the quantized clear ones.
This guide provides a complete example of converting a PyTorch neural network into its FHE-friendly, quantized counterpart. It focuses on Quantization Aware Training a simple network on a synthetic data-set.
In general, quantization can be carried out in two different ways: either during training with Quantization Aware Training (QAT) or after the training phase with Post-Training Quantization (PTQ).
In PyTorch, using standard layers, a fully connected neural network would look as follows:
This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bit accuracy remains low irrespective of the number of neurons. While all the configurations tried here were FHE-compatible (accumulator < 16 bits), it is often preferable to have a lower accumulator size in order to speed up the inference time.
The accumulator size is determined by Concrete-Numpy as being the maximum bit-width encountered anywhere in the encrypted circuit.
Brevitas provides a quantized version of almost all PyTorch layers (Linear
layer becomes QuantLinear
, ReLU
layer becomes QuantReLU
and so one), plus some extra quantization parameters, such as :
bit_width
: precision quantization bits for activations
act_quant
: quantization protocol for the activations
weight_bit_width
: precision quantization bits for weights
weight_quant
: quantization protocol for the weights
In order to use FHE, the network must be quantized from end to end, and thanks to the Brevitas's QuantIdentity
layer, it is possible to quantize the input by placing it at the entry point of the network. Moreover, it is also possible to combine PyTorch and Brevitas layers, provided that a QuantIdentity
is placed after this PyTorch layer. The following table gives the replacements to be made to convert a PyTorch NN for Concrete-ML compatibility.
Furthermore, some PyTorch operators (from the PyTorch functional API), require a brevitas.quant.QuantIdentity
to be applied on their inputs.
The QAT import tool in Concrete-ML is a work in progress. While it has been tested with some networks built with Brevitas, it is possible to use other tools to obtain QAT networks.
For instance, with Brevitas, the network above becomes :
Note that in the network above, biases are used for linear layers but are not quantized ("bias": True, "bias_quant": None
). The addition of the bias is an univariate operation and is fused into the activation function.
Training this network with pruning (see below) with 30 out of 100 total non-zero neurons gives good accuracy while keeping the accumulator size low.
The PyTorch QAT training loop is the same as the standard floating point training loop, but hyper-parameters such as learning rate might need to be adjusted.
Quantization Aware Training is somewhat slower than normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs as its computational intensity is low with respect to data transfer time.
Considering that FHE only works with limited integer precision, there is a risk of overflowing in the accumulator, which will make Concrete-ML raise an error.
The following code shows how to use pruning in the previous example:
Results with PrunedQuantNet
, a pruned version of the QuantSimpleNet
with 100 neurons on the hidden layers, are given below, showing a mean accumulator size measured over 10 runs of the experiment:
This shows that the fp32 accuracy has been improved while maintaining constant mean accumulator size.
When pruning a larger neural network during training, it is easier to obtain a low bit-width accumulator while maintaining better final accuracy. Thus, pruning is more robust than training a similar, smaller network.
In addition to the built-in models, Concrete-ML supports generic machine learning models implemented with Torch, or .
As is the most appropriate method of training neural networks that are compatible with , Concrete-ML works with , a library providing QAT support for PyTorch.
Once the model is trained, calling the from Concrete-ML will automatically perform conversion and compilation of a QAT network. Here, 3-bit quantization is used for both the weights and activations.
Suppose that n_bits_qat
is the bit-width of activations and weights during the QAT process. To import a PyTorch QAT network, you can use the library function, passing import_qat=True
:
Alternatively, if you want to import an ONNX model directly, please see . The also supports the import_qat
parameter.
-- partial support
These examples illustrate the basic usage of built-in Concrete-ML models. For more examples showing how to train high-accuracy models on more complex data-sets, see the section.
It is recommended to use to configure the speed/accuracy trade-off for tree-based models and neural networks, using grid-search or your own heuristics.
These two examples show generalized linear models (GLM) on the real-world data-set. As the non-linear, inverse-link functions are computed, these models do not use , and are, thus, very fast (~1ms execution time).
Using the data-set, this example shows how to train a classifier that detects spam, based on features extracted from email messages. A grid-search is performed over decision-tree hyper-parameters to find the best ones.
This example shows how to train tree-ensemble models (either XGBoost or Random Forest), first on a synthetic data-set, and then on the data-set. Grid-search is used to find the best number of trees in the ensemble.
Privacy-preserving prediction of house prices is shown in this example, using the data-set. Using 50 trees in the ensemble, with 5 bits of precision for the input features, the FHE regressor obtains an score of 0.90 and an execution time of 7-8 seconds.
Two different configurations of the built-in, fully-connected neural networks are shown. First, a small bit-width accumulator network is trained on and compared to a Pytorch floating point network. Second, a larger accumulator (>8 bits) is demonstrated on .
Pruning is a method to reduce neural network complexity, usually applied in order to reduce the computation cost or memory size. Pruning is used in Concrete-ML to control the size of accumulators in neural networks, thus making them FHE-compatible. See for an explanation of accumulator bit-width constraints.
Built-in include a pruning mechanism that can be parameterized by the user. The pruning type is based on L1-norm. To comply with FHE constraints, Concrete-ML uses unstructured pruning, as the aim is not to eliminate neurons or convolutional filters completely, but to decrease their accumulator bit-width.
Custom neural networks, to work well under FHE constraints, should include pruning. When implemented with PyTorch, you can use the (e.g.L1-Unstructured) to good effect.
For every neuron shown in each layer of the figure above, the linear combinations of inputs and learned weights are computed. Depending on the values of the inputs and weights, the sum - which for Concrete-ML neural networks is computed with integers - can take a range of different values.
To respect the bit-width constraint of the FHE , the values of the accumulator must remain small to be representable using a maximum of 16 bits. In other words, the values must be between 0 and .
Pruning a neural network entails fixing some of the weights to be zero during training. This is advantageous to meet FHE constraints, as irrespective of the distribution of , multiplying these input values by 0 does not increase the accumulator value.
While pruning weights can reduce the prediction performance of the neural network, studies show that a high level of pruning (above 50%) can often be applied. See here how Concrete-ML uses pruning in .
Here, is the maximum precision allowed.
For example, if and with , the worst case is where all inputs and weights are equal to their maximal value . In this case, there can be at most elements in the multi-sums.
In practice, the distribution of the weights of a neural network is Gaussian, with many weights either 0 or having a small value. This enables exceeding the worst-case number of active neurons without having to risk overflowing the bit-width. In built-in neural networks, the parameter n_hidden_neurons_multiplier
is multiplied with to determine the total number of non-zero weights that should be kept in a neuron.
The neural network models are implemented with , which provides a scikit-learn-like interface to Torch models (more ).
While NeuralNetClassifier
and NeuralNetClassifier
provide scikit-learn-like models, their architecture is somewhat restricted in order to make training easy and robust. If you need more advanced models, you can convert custom neural networks as described in the .
Good quantization parameter values are critical to make models . Weights and activations should be quantized to low precision (e.g. 2-4 bits). Furthermore, the sparsity of the network can be tuned , to avoid accumulator overflow.
The shows the behavior of built-in neural networks on several synthetic data-sets.
module__activation_function
: can be one of the Torch activations (e.g. nn.ReLU, see the full list )
n_accum_bits
(default 8): maximum accumulator bit-width that is desired. The implementation will attempt to keep accumulators under this bit-width through , i.e. setting some weights to zero
Other parameters from skorch are in the .
module__n_hidden_neurons_multiplier
: The number of hidden neurons will be automatically set proportional to the dimensionality of the input (i.e. the value for module__input_dim
). This parameter controls the proportionality factor and is set to 4 by default. This value gives good accuracy while avoiding accumulator overflow. See the and sections for more info.
These examples illustrate the basic usage of Concrete-ML to build various types of neural networks. They use simple data-sets, focusing on the syntax and usage of Concrete-ML. For examples showing how to train high-accuracy models on more complex data-sets, see the section.
The examples listed here make use of simulation (using the ) to perform evaluation over large test sets. Since FHE execution can be slow, only a few FHE executions can be performed. The of Concrete-ML ensure that accuracy measured with simulation is the same that will be obtained during FHE execution.
Following the , this notebook implements a Quantization Aware Training convolutional neural network on the MNIST data-set. It uses 3-bit weights and activations, giving a 7-bit accumulator.
Concrete-ML implements machine model inference using Concrete-Numpy as a backend. In order to execute in FHE, a numerical program written in Concrete-Numpy needs to be compiled. This functionality is , and Concrete-ML hides away most of the complexity of this step, completing the entire compilation process itself.
Additionally, the packages the result of the last step in a way that allows the deployment of the encrypted circuit to a server, as well as key generation, encryption, and decryption on the client side.
The first step in the list above takes a Python function implemented using the Concrete-Numpy and transforms it into an executable operation graph.
execution of the op-graph, which includes TLUs, on clear non-encrypted data. This is, of course, not secure, but it is much faster than executing in FHE. This mode is useful for debugging, i.e. to find the appropriate hyper-parameters. This mode is called the Virtual Library (which is referred as in Concrete-Numpy).
While Concrete-ML hides away all the Concrete-Numpy code that performs model inference, it can be useful to understand how Concrete-Numpy code works. Here is a toy example for a simple linear regression model on integers. Note that this is just an example to illustrate compilation concepts. Generally, it is recommended to use the , which provide linear regression out of the box.
The Virtual Lib in Concrete-ML is a prototype that provides drop-in replacements for Concrete-Numpy's compiler, allowing users to simulate FHE execution, including any probabilistic behavior FHE may induce. The Virtual Library comes from Concrete-Numpy, where it is called .
The Virtual Lib, being pure Python and not requiring crypto key generation, can be much faster than the actual compilation and FHE execution. This allows for faster iterations, debugging, and FHE simulation, regardless of the bit-width used. For example, this was used for the red/blue contours in the , as computing in FHE for the whole grid and all the classifiers would take significant time.
The following example considers a LogisticRegression
model on a simple classification problem. A more advanced example can be found in the , which considers a XGBClassifier
.
Concrete-ML provides several of the most popular linear models for regression
and classification
that can be found in :
Using these models in FHE is extremely similar to what can be done with scikit-learn's , making it easy for data scientists who are used to this framework to get started with Concrete-ML.
Here is an example below of how to use a LogisticRegression model in FHE on a simple data set for classification. A more complete example can be found in the .
We can then plot the decision boundary of the classifier and compare those results with a scikit-learn model executed in clear. The complete code can be found in the .
Regarding FHE-friendly neural networks, QAT is the best way to reach optimal accuracy under . This technique allows weights and activations to be reduced to very low bit-widths (e.g. 2-3 bits), which, combined with pruning, can keep accumulator bit-widths low.
Concrete-ML uses the third party library to perform QAT for PyTorch NNs, but options exist for other frameworks such as Keras/Tensorflow.
Several that use Brevitas are available in Concrete-ML library, such as the .
This guide is based on a , from which some code blocks are documented here.
For a more formal description of the usage of Brevitas to build FHE-compatible neural networks, please see the .
The , example shows how to train a fully-connected neural network, similar to the one above, on a synthetic 2D data-set with a checkerboard grid pattern of 100 x 100 points. The data is split into 9500 training and 500 test samples.
Once trained, this PyTorch network can be imported using the function. This function uses simple Post-Training Quantization.
The network was trained using different numbers of neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size shown below was extracted using the and is measured as the mean over 10 runs of the experiment. An accumulator of 6.6 means that 4 times out of 10 the accumulator measured was 6 bits while 6 times it was 7 bits.
using is the best way to guarantee a good accuracy for Concrete-ML compatible neural networks.
To understand how to overcome this limitation, consider a scenario where 2 bits are used for weights and layer inputs/outputs. The Linear
layer computes a dot product between weights and inputs . With 2 bits, no overflow can occur during the computation of the Linear
layer as long the number of neurons does not exceed 14, i.e. the sum of 14 products of 2-bit numbers does not exceed 7 bits.
By default, Concrete-ML uses symmetric quantization for model weights, with values in the interval . For example, for the possible values are , for the values can be .
However, in a typical setting, the weights will not all have the maximum or minimum values (e.g. ). Instead, weights typically have a normal distribution around 0, which is one of the motivating factors for their symmetric quantization. A symmetric distribution and many zero-valued weights are desirable because opposite sign weights can cancel each other out and zero weights do not increase the accumulator size.
This fact can be leveraged to train a network with more neurons, while not overflowing the accumulator, using a technique called , where the developer can impose a number of zero-valued weights. Torch out of the box.
fit
✓
compile
✗
predict (execute_in_fhe=False)
✓
predict (execute_in_fhe=True)
✓