1 of 30

0.3 What is Concrete ML?

⭐️ Star the repo on Github | 🗣 Community support forum | 📁 Contribute to the project

Concrete-ML is an open-source private machine learning inference framework based on fully homomorphic encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from Scikit-learn and PyTorch (see how it looks for linear models, tree-based models and neural networks).

Fully Homomorphic Encryption (FHE) is an encryption technique that allows computating directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in this introduction, or by joining the FHE.org community.

Example usage

Here is a simple example of encrypted inference using logistic regression. More examples can be found here.

import numpy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from concrete.ml.sklearn import LogisticRegression

# Lets create a synthetic dataset
N_EXAMPLE_TOTAL = 100
N_TEST = 20 if not 'N_TEST' in locals() else N_TEST
x, y = make_classification(n_samples=N_EXAMPLE_TOTAL,
    class_sep=2, n_features=4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=N_TEST / N_EXAMPLE_TOTAL, random_state=42
)

# Now we train in plaintext using quantization
model = LogisticRegression(n_bits=2)
model.fit(X_train, y_train)

y_pred_clear = model.predict(X_test)

# Finally we compile and run inference on encrypted inputs!
model.compile(x)
y_pred_fhe = model.predict(X_test, execute_in_fhe=True)

print("In clear  :", y_pred_clear)
print("In FHE    :", y_pred_fhe)
print("Comparison:", (y_pred_fhe == y_pred_clear))
# Output:
#   In clear  : [0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1]
#   In FHE    : [0 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 0 1 1 1]
#   Comparison: [ True  True  True  True  True  True  True  True  True  True  True  True
#   True  True  True  True  True  True  True  True]

This example shows the typical flow of a Concrete-ML model:

The model is trained on unencrypted (plaintext) data
The resulting model is quantized to small integers using either post-training quantization or quantization-aware training
The quantized model is compiled to a FHE equivalent (under the hood, the model is first converted to a Concrete-Numpy program, then compiled)
Inference can then be done on encrypted data

To make a model work with FHE, the only constrain is to make it run within the supported precision limitations of Concrete-ML (currently 8-bit integers).

Current Limitations

Concrete-ML is built on top of Zama’s Concrete framework. It uses Concrete-Numpy, which itself uses the Concrete-Compiler and the Concrete-Library. To use these libraries directly, refer to the Concrete-Numpy and Concrete-Framework documentations.

Currently, Concrete only supports 8-bit encrypted integer arithmetics. This requires models to be quantized heavily, which sometimes leads to loss of accuracy vs the plaintext model. Furthermore, the Concrete-Compiler is still a work in progress, meaning it won't always find optimal performance parameters, leading to slower than expected execution times.

Additionally, Concrete-ML currently only supports FHE inference. Training on the other hand has to be done on unencrypted data, producing a model which is then converted to an FHE equivalent that can do encrypted inference.

Finally, there is currently no support for pre and post processing in FHE. Data must arrive to the FHE model already pre-processed and post-processing (if there is any) has to be done client-side.

All of these issues are currently being addressed and significant improvements are expected to be released in the coming months.

Additional resources

Looking for support? Ask our team!

Getting Started

Installation

Using PyPi

Requirements

Installing Concrete-ML using PyPi requires a Linux-based OS or macOS running on an x86 CPU. For Apple Silicon, Docker is the only currently supported option (see below).

Installing on Windows can be done using Docker or WSL. On WSL, Concrete-ML will work as long as the package is not installed in the /mnt/c/ directory, which corresponds to the host OS filesystem.

Installation

To install Concrete-ML from PyPi, run the following:

pip install -U pip wheel setuptools
pip install concrete-ml

Using Docker

Concrete-ml can be installed using Docker by either pulling the latest image or a specific version:

docker pull zamafhe/concrete-ml:latest
# or
docker pull zamafhe/concrete-ml:v0.3.0

The image can be used with Docker volumes, see the Docker documentation here.

The image can then be used via the following command:

# Without local volume:
docker run --rm -it -p 8888:8888 zamafhe/concrete-ml

# With local volume to save notebooks on host:
docker run --rm -it -p 8888:8888 -v /host/path:/data zamafhe/concrete-ml

This will launch a Concrete-ML enabled Jupyter server in Docker that can be accessed directly from a browser.

Alternatively, a shell can be lauched in Docker, with or without volumes:

docker run --rm -it zamafhe/concrete-ml /bin/bash

Key Concepts

Concrete-ML is built on top of Concrete-Numpy, which enables Numpy programs to be converted into FHE circuits.

The lifecycle of a Concrete ML program is as follows:

Training. A model is trained using plaintext inputs.
Quantization. The trained model is then converted into an integer equivalent using quantization, which can happen either during training (Quantization-Aware Training) or after training (Post-Training Quantization).
Compilation. Once the model is quantized, it is compiled using Concrete's FHE compiler to produce an equivalent FHE circuit. This circuit is represented as an MLIR program consisting of low level cryptographic operations. You can read more about FHE compilation here, MLIR here and about the low-level Concrete library here.
Inference. The compiled model can then be deployed to a server and used to run private inference on encrypted inputs. You can see some examples here.

Here is an example for a simple linear regression model:

import numpy
from concrete.numpy.compilation import compiler

# Let's assume Quantization has been applied and we are left with integers only.
# This is essentially the work of Concrete-ML

# Some parameters (weight and bias) for our model taking a single feature
w = [2]
b = 2

# The function that implements our model
@compiler({"x": "encrypted"})
def linear_model(x):
    return w @ x + b

# A representative inputset is needed to compile the function
# (used for tracing)
n_bits_input = 2
inputset = numpy.arange(0, 2**n_bits_input).reshape(-1, 1)
circuit = linear_model.compile(inputset)

# Use the API to get the maximum bitwidth in the circuit
max_bitwidth = circuit.graph.maximum_integer_bit_width()
print("Max bitwidth = ", max_bitwidth)
# Max bitwidth =  4

# Test our FHE inference
circuit.encrypt_run_decrypt(numpy.array([3]))
# 8

# Print the graph of the circuit
print(circuit)
# %0 = 2                     # ClearScalar<uint2>
# %1 = [2]                   # ClearTensor<uint2, shape=(1,)>
# %2 = x                     # EncryptedTensor<uint2, shape=(1,)>
# %3 = matmul(%1, %2)        # EncryptedScalar<uint3>
# %4 = add(%3, %0)           # EncryptedScalar<uint4>
# return %4

At this stage, we have everything we need to deploy the model using Client and Server from concrete.numpy. Please refer to the Concrete-Numpy implementation for more information on the deployment.

Quantization

The current version of Concrete only support up to 8-bits integers. This means that any floating point or large precision integer model will need to be converted to an 8-bit equivalent to be able to work with FHE. In most cases, this will require both quantization and pruning.

If you try to compile a program using more than 8 bits, the compiler will throw an error, as shown in this example:

import concrete.numpy as cnp

def f(x):
    return 42 * x

compiler = cnp.Compiler(f, {"x": "encrypted"})
circuit = compiler.compile(range(8))

Compiler output:

RuntimeError: max_bit_width of some nodes is too high for the current version of the compiler (maximum must be 7), which is not compatible with:

%0 = x                  # EncryptedScalar<uint3>
%1 = 42                 # ClearScalar<uint6>
%2 = mul(%0, %1)        # EncryptedScalar<uint9>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 9 bits is not supported for the time being
return %2

Notice that the maximum bit width, determined by the compiler, depends on the inputset passed to the compile_on_inputset function. In this case, the error is caused by the input value in the inputset that produces a result whose representation requires 9 bits. This input is the value 8, since 8 * 42 = 336, which is a 9-bit value.

You can determine the number of bits necessary to represent an integer value with the formula:

$n_{\mathsf{bits}}(x) = \mathsf{floor}(\mathsf{log}_2(x)) + 1$

While this can seem like a major limitation, in practice machine learning models have features that use only a limited range of values. For example, if a feature takes a value that is limited to the range $[1, 2)$ , in floating point this value is represented as $2^0 * \mathsf{mantissa}$ where $\mathsf{mantissa}$ is a number between 1 and 2. Generic floating point representation can support exponents between -126 and 127, allocating 8 bits to store the exponent. In our case, a single exponent value of 0 is needed. Knowing that, for our range, the exponent can only take a single value out for 253 possible ones. We can thus save the 8 bits allocated to the exponent, reducing the bit width necessary.

For a more practical example, the MNIST classification task consists of taking an image, a 28x28 array containing uint8 values representing a handwritten digit, and predicting whether it belongs to one of 10 classes: the digits from 0 to 9. The output is a one-hot vector which indicates the class to which a particular sample belongs.

The input contains 28x28x8 bits, so 6272 bits of information. In practice, you could still obtain good results on MNIST by thresholding the pixels to {0, 1} and training a model for this new binarized MNIST task. This means that in a real use case where you actually need to perform digit recognition, you could binarize your input on the fly, replacing each pixel with either 0 or 1. In doing so, you use 1 bit per pixel and now only have 784 bits of input data. It also means that if you are doing some accumulation (adding pixel values together), you are going to need accumulators that are smaller (adding 0s and 1s requires less space than adding values ranging from 0 to 255). An example of MNIST classification with a quantized neural network is given in the CNN advanced example.

This shows how adapting your data or model parameters can allow you to use models that may require smaller data types (i.e. use less precision) to perform their computations.

Binarization is an extreme case of quantization which is introduced here. You can also find further resources on the linked page.

While applying quantization directly to the input features is mandatory to reduce the effective bit width of computations, a different and complementary approach is dimensionality reduction. This can be accomplished through Principal Component Analysis (PCA) as shown in the Poisson Regression example

Quantized model accuracy

Quantization and dimensionality reduction reduce the bit width required to run the model and increase execution speed. These two tools are necessary to make models compatible with FHE constraints.

However, quantization and, especially, binarization, induce a loss in the accuracy of the model since its representation power is diminished. Carefully choosing a quantization approach for model parameters can alleviate accuracy loss, all the while allowing compilation to FHE.

The quantization of model parameters and model inputs is illustrated in the advanced examples for Linear and Logistic Regressions. Note that different quantization parameters are used for inputs and for model weights.

Limitations for FHE friendly neural networks

Recent quantization literature usually aims to make use of dedicated machine learning accelerators in a mixed setting where a CPU or General Purpose GPU (GPGPU) is also available. Thus, in literature, some floating point computation is assumed to be acceptable. This approach allows us to reach performance similar to those achieved by floating point models. In this popular mixed float-int setting, the input is usually left in floating point. This is also true for the first and last layers, which have more impact on the resulting model accuracy than hidden layers.

However, in Concrete-ML, to respect FHE constraints, the inputs, the weights and the accumulator must all be represented with integers of a maximum of 8 bits.

Thus, in Concrete-ML, we also quantize the input data and network output activations in the same way as the rest of the network: everything is quantized to a specific number of bits. It turns out that the number of bits used for the input or the output of any activation function is crucial to comply with the constraint on accumulator width.

The core operations in neural networks are matrix multiplications (matmul) and convolutions, which both compute linear combinations of inputs (encrypted) and weights (in clear). The linear combination operation must be done such that the maximum value of its result requires at most 8 bits of precision.

For example, if you quantize your input and weights with $n_{\mathsf{weights}}$ , $n_{\mathsf{inputs}}$ bits of precision, one can compute the maximum dimensionality of the input and weights before the matmul/convolution result could exceed the 8 bits as such:

$\Omega = \mathsf{floor} \left( \frac{2^{n_{\mathsf{max}}} - 1}{(2^{n_{\mathsf{weights}}} - 1)(2^{n_{\mathsf{inputs}}} - 1)} \right)$

where $n_{\mathsf{max}} = 8$ is the maximum precision allowed. For example, if we set $n_{\mathsf{weights}} = 2$ and $n_{\mathsf{inputs}} = 2$ with $n_{\mathsf{max}} = 8$ , then we have the $\Omega = 28$ different inputs/weights are allowed in the linear combination.

Exceeding $\Omega$ dimensions in the input and weights, the risk of overflow increases quickly. It may happen that for some distributions of weights and values the computation does not overflow, but the risk increases rapidly with the number of dimensions.

Currently, Concrete-ML computes the number of bits needed for the computation depending on the inputset calibration data and does not allow the overflow to happen, raising an exception as shown previously.

Built-in Models

Linear Models

Concrete-ML provides several of the most popular linear models for regression or classification that can be found in Scikit-learn:

Concrete-ML

scikit-learn

Using these models in FHE is extremely similar to what can be done with scikit-learn's API, making it easy for data scientists that are used to this framework to get started with Concrete ML.

Models are also compatible with some of scikit-learn's main worflows, such as Pipeline() or GridSearch().

Example

Here's an example of how to use this model in FHE on a simple dataset below. A more complete example can be found in the LogisticRegression notebook.

import numpy
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from concrete.ml.sklearn import LogisticRegression

# Create the data for classification
X, y = make_classification(
    n_features=2,
    n_redundant=0,
    n_informative=2,
    random_state=2,
    n_clusters_per_class=1,
    n_samples=100,
)

# Retrieve train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Instantiate the model
model = LogisticRegression(n_bits=2)

# Fit the model
model.fit(X_train, y_train)

# Evaluate the model on the test set in clear
y_pred_clear = model.predict(X_test)

# Compile the model
model.compile(X_train)

# Perform the inference in FHE
# Warning: this will take a while. It is recommended to run this with a very small batch of
# example first (e.g. N_TEST_FHE = 1)
# Note that here the encryption and decryption is done behind the scene.
N_TEST_FHE = 1
y_pred_fhe = model.predict(X_test[:N_TEST_FHE], execute_in_fhe=True)

# Assert that FHE predictions are the same as the clear predictions
print(f"{(y_pred_fhe == y_pred_clear[:N_TEST_FHE]).sum()} "
      f"examples over {N_TEST_FHE} have a FHE inference equal to the clear inference.")

# Output:
#  3 examples over 3 have a FHE inference equal to the clear inference

We can then plot how the model classifies the inputs and then compare those results with a scikit-learn model executed in clear. The complete code can be found in the LogisticRegression notebook.

We can clearly observe the impact of quantization over the decision boundaries in the FHE model, breaking the initial lines into broken lines with steps. However, this does not change the overall score as both models output the same accuracy (90%).

In fact, the quantization process may sometimes create some artifacts that could lead to a decrease in performance. Still, the impact of those artifacts is often minor when considering linear models, making FHE models reach similar scores as their equivalent clear ones.

Neural Networks

Concrete-ML provides simple neural networks models with a Scikit-learn interface through the NeuralNetClassifier and NeuralNetRegressor classes. The neural network models are built with Skorch, which provides a scikit-learn like interface to Torch models (more here).

Currently, only linear layers are supported, but the number of layers, the activation function and the number of neurons in each layer is configurable. This approach is similar to what is available in Scikit-learn using the MLPClassifier/MLPRegressor classes. The built-in fully connected neural network (FCNN) models train easily with a single call to .fit(), which will automatically quantize the weights and activations.

While NeuralNetClassifier and NeuralNetClassifier provide scikit-learn like models, their architecture is somewhat restricted in order to make training easy and robust. If you need more advanced models you can convert custom neural networks, as described in the FHE-friendly models documentation.

Example usage

To create an instance of a Fully Connected Neural Network you need to instantiate one of the NeuralNetClassifier and NeuralNetRegressor classes and configure a number of parameters that are passed to their constructor. Note that some parameters need to be prefixed by module__, while others don't. Basically, the parameters that are related to the model, i.e. the underlying nn.Module, must have the prefix. The parameters that are related to training options do not require the prefix.

from concrete.ml.sklearn import NeuralNetClassifier
import torch.nn as nn

n_inputs = 10
n_outputs = 2
params = {
    "module__n_layers": 2,
    "module__n_w_bits": 2,
    "module__n_a_bits": 2,
    "module__n_accum_bits": 8,
    "module__n_hidden_neurons_multiplier": 1,
    "module__n_outputs": n_outputs,
    "module__input_dim": n_inputs,
    "module__activation_function": nn.ReLU,
    "max_epochs": 10,
}

concrete_classifier = NeuralNetClassifier(**params)

Architecture parameters

module__n_layers: number of layers in the FCNN, must be at least 1
module__n_outputs: number of outputs (classes or targets)
module__input_dim: dimensionality of the input
module__activation_function: can be one of the Torch activations (e.g. nn.ReLU, see the full list here)

Quantization parameters

n_w_bits (default 3): number of bits for weights
n_a_bits (default 3): number of bits for activations and inputs
n_accum_bits (default 8): maximum accumulator bit width that is desired. The implementation will attempt to keep accumulators under this bitwidth through pruning, i.e. setting some weights to zero

Training parameters (from Skorch)

max_epochs: The number of epochs to train the network (default 10),
verbose: Whether to log loss/metrics during training (default: False)
lr: Learning rate (default 0.001)
Other parameters from skorch are in the Skorch documentation

Advanced parameters

module__n_hidden_neurons_multiplier: The number of hidden neurons will be automatically set proportional to the dimensionality of the input (i.e. the vlaue for module__input_dim). This parameter controls the proportionality factor, and is by default set to 4. This value gives good accuracy while avoiding accumulator overflow.

Advanced use

Network input/output

When you have training data in the form of a Numpy array, and targets in a Numpy 1d array, you can set:

    classes = np.unique(y_all)
    params["module__input_dim"] = x_train.shape[1]
    params["module__n_outputs"] = len(classes)

Class weights

You can give weights to each class, to use in training. Note that this must be supported by the underlying torch loss function.

    from sklearn.utils.class_weight import compute_class_weight
    params["criterion__weight"] = compute_class_weight("balanced", classes=classes, y=y_train)

Overflow errors

The n_hidden_neurons_multiplier parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing n_hidden_neurons_multiplier improves accuracy, but should take into account precision limitations to avoid overflow in the accumulator. The default value is a good compromise that avoids overflow, in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors. A value of 1 should be completely safe with respect to overflow.

Examples

The following table summarizes the various examples in this section, along with their accuracies.

Model

Dataset

Metric

Clear

Quantized

FHE

A * means that FHE accuracy was calculated on a subset of the validation set.

Concrete-ML models

Comparison of classifiers

Kaggle competition

Deep Learning

Using Torch

In addition to the built-in models, Concrete-ML supports generic machine learning models implemented with Torch, or exported as ONNX graphs.

The following example uses a simple torch model that implements a fully connected neural network with two hidden units. Due to its small size, making this model respect FHE constraints is relatively easy.

from torch import nn
import torch

N_FEAT = 2
class SimpleNet(nn.Module):
    """Simple MLP with torch"""

    def __init__(self, n_hidden=30):
        super().__init__()
        self.fc1 = nn.Linear(in_features=N_FEAT, out_features=n_hidden)
        self.fc2 = nn.Linear(in_features=n_hidden, out_features=n_hidden)
        self.fc3 = nn.Linear(in_features=n_hidden, out_features=2)


    def forward(self, x):
        """Forward pass."""
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Once the model is trained, calling the compile_torch_model from Concrete-ML will automatically perform post-training quantization and compilation to FHE. Here, we use a 3-bit quantization for both the weights and activations.

from concrete.ml.torch.compile import compile_torch_model
import numpy
torch_input = torch.randn(100, N_FEAT)
torch_model = SimpleNet(30)
quantized_numpy_module = compile_torch_model(
    torch_model, # our model
    torch_input, # a representative inputset to be used for both quantization and compilation
    n_bits = 3,
)

The model can now be used to perform encrypted inference. Next, the test data is quantized:

x_test = numpy.array([numpy.random.randn(N_FEAT)])
x_test_quantized = quantized_numpy_module.quantize_input(x_test)

and the encrypted inference run using either:

quantized_numpy_module.forward_and_dequant() to compute predictions in the clear, on quantized data and then de-quantize the result. The return value of this function contains the dequantized (float) output of running the model in the clear. Calling the forward function on the clear data is useful when debugging. The results in FHE will be the same as those on clear quantized data.
quantized_numpy_module.forward_fhe.encrypt_run_decrypt() to perform the FHE inference. In this case, de-quantization is done in a second stage using quantized_numpy_module.dequantize_output().

Quantization aware training

While the example above shows how to import a floating point model for post-training quantization, Concrete-ML also provides an option to import quantization aware trained (QAT) models.

QAT models contain quantizers in the torch graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Torch quantizers are not included in Concrete-ML, so you can either implement your own or use a 3rd party library such as brevitas as shown in the FHE-friendly models documentation. Custom models can have a more generic architecture and training procedure than the Concrete-ML built-in models.

Suppose that n_bits_qat is the bitwidth of activations and weights during the QAT process. To import a torch QAT network you can use the following library function:

n_bits_qat = 3

quantized_numpy_module = compile_torch_model(
    torch_model,
    torch_input,
    import_qat=True,
    n_bits=n_bits_qat,
)

Supported Operators and Activations

Concrete-ML supports a variety of torch operators that can be used to build fully connected or convolutional neural networks, with normalization and activation layers. Moreover, many element-wise operators are supported.

Operators

Univariate operators

Shape modifying operators

Operators that take an encrypted input and unencrypted constants

Operators that can take both encrypted+unencrypted and encrypted+encrypted inputs

Activations

torch.nn.Celu
torch.nn.Elu
torch.nn.GELU -- sometimes accuracy issues
torch.nn.Hardshrink
torch.nn.HardSigmoid
torch.nn.Hardswish
torch.nn.HardTanh
torch.nn.LeakyRelu
torch.nn.LogSigmoid -- sometimes accuracy issues
torch.nn.Mish
torch.nn.PReLU
torch.nn.ReLU6
torch.nn.ReLU
torch.nn.Selu
torch.nn.Sigmoid
torch.nn.SiLU
torch.nn.Softplus
torch.nn.Softshrink
torch.nn.Softsign
torch.nn.Tanh
torch.nn.Tanhshrink
torch.nn.Threshold -- partial support

Note that the equivalent versions from torch.functional are also supported.

Using ONNX

In addition to Concrete-ML models and to custom models in torch, it is also possible to directly compile ONNX models. This can be particularly appealing, notably to import models trained with Keras (see in this subsection. It can also be interesting in the context of QAT (see in this subsection), since lot of ONNX are available on the web.

ONNX models can be compiled by performing post-training quantization (PTQ) or by directly importing models that are already quantized with quantization aware learning (QAT).

Post training quantization

The following example shows how to compile an ONNX model using post-training quantization. The model was initially tained using Keras, before being exported to ONNX. The training code is not shown here.

import numpy
import onnx
import tensorflow
import tf2onnx

from concrete.ml.torch.compile import compile_onnx_model
from concrete.numpy.compilation import Configuration


class FC(tensorflow.keras.Model):
    """A fully-connected model."""

    def __init__(self):
        super().__init__()
        hidden_layer_size = 10
        output_size = 5

        self.dense1 = tensorflow.keras.layers.Dense(
            hidden_layer_size,
            activation=tensorflow.nn.relu,
        )
        self.dense2 = tensorflow.keras.layers.Dense(output_size, activation=tensorflow.nn.relu6)
        self.flatten = tensorflow.keras.layers.Flatten()

    def call(self, inputs):
        """Forward function."""
        x = self.flatten(inputs)
        x = self.dense1(x)
        x = self.dense2(x)
        return self.flatten(x)


n_bits = 6
input_output_feature = 2
input_shape = (input_output_feature,)
num_inputs = 1
n_examples = 5000

# Define the Keras model
keras_model = FC()
keras_model.build((None,) + input_shape)
keras_model.compute_output_shape(input_shape=(None, input_output_feature))

# Create random input
inputset = numpy.random.uniform(-100, 100, size=(n_examples, *input_shape))

# Convert to ONNX
tf2onnx.convert.from_keras(keras_model, opset=14, output_path="tmp.model.onnx")

onnx_model = onnx.load("tmp.model.onnx")
onnx.checker.check_model(onnx_model)

# Compile
quantized_numpy_module = compile_onnx_model(
    onnx_model, inputset, n_bits=2
)

# Create test data from the same distribution and quantize using
# learned quantization parameters during compilation
x_test = tuple(numpy.random.uniform(-100, 100, size=(1, *input_shape)) for _ in range(num_inputs))
qtest = quantized_numpy_module.quantize_input(x_test)

y_clear = quantized_numpy_module(*qtest)
y_fhe = quantized_numpy_module.forward_fhe.encrypt_run_decrypt(*qtest)

print("Execution in clear: ", y_clear)
print("Execution in FHE:   ", y_fhe)
print("Equality:           ", numpy.sum(y_clear == y_fhe), "over", numpy.size(y_fhe), "values")

While Keras was used in this example, it is not officially supported, as additional work is needed to test all of Keras' types of layer and models.

Importing an existing model trained with QAT

QAT models contain quantizers in the ONNX graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Since these QAT models have quantizers that are configured during training to a specific number of bits, the ONNX graph will need to be imported using the same settings:

n_bits_qat = 3  # number of bits for weights and activations during training

quantized_numpy_module = compile_onnx_model(
    onnx_model,
    inputset,
    import_qat=True,
    n_bits=n_bits_qat,
)

Supported operators

The following operators are supported for evaluation and conversion to an equivalent FHE circuit. Other operators were not implemented either due to FHE constraints, or because they are rarely used in PyTorch activations or scikit-learn models.

Abs
Acos
Acosh
Add
Asin
Asinh
Atan
Atanh
AveragePool
BatchNormalization
Cast
Celu
Clip
Constant
Conv
Cos
Cosh
Div
Elu
Equal
Erf
Exp
Flatten
Gemm
Greater
GreaterOrEqual
HardSigmoid
HardSwish
Identity
LeakyRelu
Less
LessOrEqual
Log
MatMul
Mul
Not
Or
PRelu
Pad
Pow
ReduceSum
Relu
Reshape
Round
Selu
Sigmoid
Sin
Sinh
Softplus
Sub
Tan
Tanh
ThresholdedRelu
Transpose
Where

Examples

This section includes a complete example of a neural network in Torch, as well as links to additional examples.

Post-training quantization

In this example, we will train a fully-connected neural network on a synthetic 2D dataset with a checkerboard grid pattern of 100 x 100 points. The data is split into 9500 training and 500 test samples.

This network was trained using different numbers neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size shown below was extracted using the .

neurons

100

This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bit accuracy remains low irrespective of to the number of neurons. While all the configurations tried here were FHE compatible (accumulator < 8 bits), it is sometimes preferable to have lower accumulator size in order for the inference time to be faster.

The accumulator size is determined by Concrete Numpy as being the maximum bitwidth encountered anywhere in the encrypted circuit

Pruning using Torch

Considering that FHE only works with limited integer precision, there is a risk of overflowing in the accumulator, resulting in unpredictable results.

To understand how to overcome this limitation, consider a scenario where 2 bits are used for weights and layer inputs/outputs. The Linear layer computes a dot product between weights and inputs . With 2 bits, no overflow can occur during the computation of the Linear layer as long the number of neurons does not exceed 14, i.e. the sum of 14 products of 2-bit numbers does not exceed 7 bits.

By default, Concrete-ML uses symmetric quantization for model weights, with values in the interval . For example, for the possible values are , for the values can be .

However, in a typical setting, the weights will not all have the maximum or minimum value (e.g. ). Instead, weights typically have a normal distribution around 0, which is one of the motivating factors for their symmetric quantization. A symmetric distribution and many zero-valued weights are desirable because opposite sign weights can cancel each other out and zero weights do not increase the accumulator size.

This can be leveraged to train network with more neurons, while not overflowing the accumulator, using a technique called , where the developer can impose a number of zero-valued weights. Torch out of the box.

The following code shows how to use pruning in our previous example:

Results with PrunedSimpleNet, a pruned version of the SimpleNet with 100 neurons on the hidden layers are given below:

non-zero neurons

This shows that the fp32 accuracy has been improved while maintaining constant mean accumulator size.

When pruning a larger neural network during training, it is easier to obtain a low a bitwidth accumulator while maintaining better final accuracy. Thus, pruning is more robust than training a similar smaller network.

Quantization-aware training (QAT)

While pruning helps maintain the post-quantization level of accuracy in low-precision settings, it does not help maintain accuracy when quantizing from floating point models. The best way to guarantee accuracy is to use quantization-aware training (read more in the ).

In this example, QAT is done using , changing Linear layers to QuantLinear and adding quantizers on the inputs of linear layers using QuantIdentity.

The quantization-aware training (QAT) import tool in Concrete-ML is a work in progress. While it has been tested with some networks built with Brevitas, it is possible to use other tools to obtain QAT networks.

Training this network with 30 non-zero neurons out of 100 total gives good accuracy while being FHE compatible (accumulator size < 8 bits).

non-zero neurons

The torch QAT training loop is the same as the standard floating point training loop, but hyperparameters such as learning rate might need to be adjusted.

Quantization Aware Training is somewhat slower thant normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs as its computational intensity is low with respect to data transfer time.

Additional examples

The following table summarizes the examples in this section.

Model

Dataset

Metric

Clear

Quantized

FHE

In this table, ** means that the accuracy is actually random-like, because the quantization we need to set to fullfil bitwidth constraints is too strong.

Examples

Debugging Models

This section provides a set of tools and guidelines to help users build optimized FHE compatible models.

Virtual Lib

The Virtual Lib in Concrete-ML is a prototype that provides drop-in replacements for Concrete-Numpy's compiler that allow users to simulate what would happen when converting a model to FHE without the current bit width constraint, as well as quickly simulating the behavior with 8 bits or less without actually doing the FHE computations.

The Virtual Lib can be useful when developing and iterating on an ML model implementation. For example, you can check that your model is compatible in terms of operands (all integers) with the Virtual Lib compilation. Then, you can check how many bits your ML model would require, which can give you hints as to how it should be modified if you want to compile it to an actual FHE Circuit (not a simulated one) that only supports 8 bits of integer precision.

The Virtual Lib, being pure Python and not requiring crypto key generation, can be much faster than the actual compilation and FHE execution, thus allowing for faster iterations, debugging and FHE simulation, regardless of the bit width used. This was for example used for the red/blue contours in the Classifier Comparison notebook, as computing in FHE for the whole grid and all the classifiers would take significant time.

The following example shows how to use the Virtual Lib in Concrete-ML. Simply add use_virtual_lib = True and enable_unsafe_features = True in a Configuration. The result of the compilation will then be a simulated circuit that allows for more precision or simulated FHE execution.

from sklearn.datasets import fetch_openml, make_circles
from concrete.ml.sklearn import RandomForestClassifier
from concrete.numpy import Configuration
debug_config = Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location="~/.cml_keycache",
)

n_bits = 2
X, y = make_circles(n_samples=1000, noise=0.1, factor=0.6, random_state=0)
concrete_clf = RandomForestClassifier(
    n_bits=n_bits, n_estimators=10, max_depth=5
)
concrete_clf.fit(X, y)

concrete_clf.compile(X, debug_config, use_virtual_lib=True)

y_preds_clear = concrete_clf.predict(X)

Compilation debugging

The following example produces a neural network that is not FHE compatible:

import numpy
import torch

from torch import nn
from concrete.ml.torch.compile import compile_torch_model

N_FEAT = 2
class SimpleNet(nn.Module):
    """Simple MLP with torch"""

    def __init__(self, n_hidden=30):
        super().__init__()
        self.fc1 = nn.Linear(in_features=N_FEAT, out_features=n_hidden)
        self.fc2 = nn.Linear(in_features=n_hidden, out_features=n_hidden)
        self.fc3 = nn.Linear(in_features=n_hidden, out_features=2)


    def forward(self, x):
        """Forward pass."""
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x


torch_input = torch.randn(100, N_FEAT)
torch_model = SimpleNet(120)
try:
    quantized_numpy_module = compile_torch_model(
        torch_model,
        torch_input,
        n_bits = 3,
    )
except RuntimeError as err:
    print(err)

Upon execution, the compiler will raise the following error:

%0 = [[-1 -3] [ ... ] [-2  2]]        # ClearTensor<int3, shape=(120, 2)>
 %1 = [[ 1  3 -2 ...  1  2  0]]        # ClearTensor<int3, shape=(120, 120)>
 %2 = [[ 2  0  3 ... -2 -2 -1]]        # ClearTensor<int3, shape=(2, 120)>
 %3 = _onnx__Gemm_0                    # EncryptedTensor<uint5, shape=(1, 2)>
 %4 = -15                              # ClearScalar<int5>
 %5 = add(%3, %4)                      # EncryptedTensor<int6, shape=(1, 2)>
 %6 = subgraph(%5)                     # EncryptedTensor<int3, shape=(1, 2)>
 %7 = matmul(%6, %2)                   # EncryptedTensor<int6, shape=(1, 120)>
 %8 = subgraph(%7)                     # EncryptedTensor<uint3, shape=(1, 120)>
 %9 = matmul(%8, %1)                   # EncryptedTensor<int9, shape=(1, 120)>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only up to 8-bit integers are supported
%10 = subgraph(%9)                     # EncryptedTensor<uint3, shape=(1, 120)>
%11 = matmul(%10, %0)                  # EncryptedTensor<int8, shape=(1, 2)>
%12 = subgraph(%11)                    # EncryptedTensor<uint5, shape=(1, 2)>
return %12

Knowing that a linear/dense layer is implemented as a matrix multiplication, it can determined which parts of the op-graph listing in the exception message above correspond to which layers:

Layer weights initialization:

%0 = [[-1 -3] [ ... ] [-2  2]]        # ClearTensor<int3, shape=(120, 2)>
 %1 = [[ 1  3 -2 ...  1  2  0]]        # ClearTensor<int3, shape=(120, 120)>
 %2 = [[ 2  0  3 ... -2 -2 -1]]        # ClearTensor<int3, shape=(2, 120)>

Input processing and quantization:

 %3 = _onnx__Gemm_0                    # EncryptedTensor<uint5, shape=(1, 2)>
 %4 = -15                              # ClearScalar<int5>
 %5 = add(%3, %4)                      # EncryptedTensor<int6, shape=(1, 2)>
 %6 = subgraph(%5)                     # EncryptedTensor<int3, shape=(1, 2)>

First dense layer and activation function:

%7 = matmul(%6, %2)                   # EncryptedTensor<int6, shape=(1, 120)>
%8 = subgraph(%7)                     # EncryptedTensor<uint3, shape=(1, 120)>

Second dense layer and activation function:

%9 = matmul(%8, %1)                   # EncryptedTensor<int9, shape=(1, 120)>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only up to 8-bit integers are supported
%10 = subgraph(%9)                     # EncryptedTensor<uint3, shape=(1, 120)>

Third dense layer and output quantization:

%11 = matmul(%10, %0)                  # EncryptedTensor<int8, shape=(1, 2)>
%12 = subgraph(%11)                    # EncryptedTensor<uint5, shape=(1, 2)>
return %12

We can see here that the error is in the second layer. Reducing the number of neurons in this layer will resolve the error and make the network FHE compatible:

torch_model = SimpleNet(50)
try:
    quantized_numpy_module = compile_torch_model(
        torch_model,
        torch_input,
        n_bits = 3,
    )
except RuntimeError as err:
    print(err)

Complexity analysis

In FHE, univariate functions are encoded as table lookups, which are then implemented using Programmable Bootstrapping (PBS). Programmable bootstrapping is a powerful technique, but will require significantly more compute resources and thus time than more simpler encrypted operations such matrix multiplications, convolution or additions.

Furthermore, the cost of a PBS will depend on the bitwidth of the compiled circuit. Every additional bit in the maximum bitwidth raises the complexity of the PBS by a significant factor. It thus may be of interest to the model developer to determine the bitwidth of the circuit and the number of PBS it performs.

This can be done by inspecting the MLIR code produced by the compiler:

Concrete-ML Model

torch_model = SimpleNet(50)
try:
    quantized_numpy_module = compile_torch_model(
        torch_model,
        torch_input,
        n_bits = 3,
        show_mlir=True,
    )
except RuntimeError as err:
    print(err)

Compiled MLIR model

%cst = arith.constant dense<...> : tensor<50x2xi9>
%cst_0 = arith.constant dense<...>
%cst_1 = arith.constant dense<...> : tensor<2x50xi9>
%c-14_i9 = arith.constant -14 : i9
%c128_i9 = arith.constant 128 : i9
%c128_i9_2 = arith.constant 128 : i9
%c128_i9_3 = arith.constant 128 : i9
%c128_i9_4 = arith.constant 128 : i9
%hack_0_c-14_i9 = tensor.from_elements %c-14_i9 : tensor<1xi9>
%0 = "FHELinalg.add_eint_int"(%arg0, %hack_0_c-14_i9) : (tensor<1x2x!FHE.eint<8>>, tensor<1xi9>) -> tensor<1x2x!FHE.eint<8>>
%hack_1_c128_i9_4 = tensor.from_elements %c128_i9_4 : tensor<1xi9>
%1 = "FHELinalg.add_eint_int"(%0, %hack_1_c128_i9_4) : (tensor<1x2x!FHE.eint<8>>, tensor<1xi9>) -> tensor<1x2x!FHE.eint<8>>
%cst_5 = arith.constant dense<...> : tensor<256xi64>
%2 = "FHELinalg.apply_lookup_table"(%1, %cst_5) : (tensor<1x2x!FHE.eint<8>>, tensor<256xi64>) -> tensor<1x2x!FHE.eint<8>>

%3 = "FHELinalg.matmul_eint_int"(%2, %cst_1) : (tensor<1x2x!FHE.eint<8>>, tensor<2x50xi9>) -> tensor<1x50x!FHE.eint<8>>
%hack_4_c128_i9_3 = tensor.from_elements %c128_i9_3 : tensor<1xi9>
%4 = "FHELinalg.add_eint_int"(%3, %hack_4_c128_i9_3) : (tensor<1x50x!FHE.eint<8>>, tensor<1xi9>) -> tensor<1x50x!FHE.eint<8>>
%cst_6 = arith.constant dense<...> : tensor<34x256xi64>
%cst_7 = arith.constant dense<...]> : tensor<1x50xindex>
%5 = "FHELinalg.apply_mapped_lookup_table"(%4, %cst_6, %cst_7) : (tensor<1x50x!FHE.eint<8>>, tensor<34x256xi64>, tensor<1x50xindex>) -> tensor<1x50x!FHE.eint<8>>

%6 = "FHELinalg.matmul_eint_int"(%5, %cst_0) : (tensor<1x50x!FHE.eint<8>>, tensor<50x50xi9>) -> tensor<1x50x!FHE.eint<8>>
%hack_7_c128_i9_2 = tensor.from_elements %c128_i9_2 : tensor<1xi9>
%7 = "FHELinalg.add_eint_int"(%6, %hack_7_c128_i9_2) : (tensor<1x50x!FHE.eint<8>>, tensor<1xi9>) -> tensor<1x50x!FHE.eint<8>>
%cst_8 = arith.constant dense<...> : tensor<34x256xi64>
%cst_9 = arith.constant dense<...> : tensor<1x50xindex>
%8 = "FHELinalg.apply_mapped_lookup_table"(%7, %cst_8, %cst_9) : (tensor<1x50x!FHE.eint<8>>, tensor<34x256xi64>, tensor<1x50xindex>) -> tensor<1x50x!FHE.eint<8>>

%9 = "FHELinalg.matmul_eint_int"(%8, %cst) : (tensor<1x50x!FHE.eint<8>>, tensor<50x2xi9>) -> tensor<1x2x!FHE.eint<8>>
%hack_10_c128_i9 = tensor.from_elements %c128_i9 : tensor<1xi9>
%10 = "FHELinalg.add_eint_int"(%9, %hack_10_c128_i9) : (tensor<1x2x!FHE.eint<8>>, tensor<1xi9>) -> tensor<1x2x!FHE.eint<8>>
%cst_10 = arith.constant dense<...> : tensor<2x256xi64>
%cst_11 = arith.constant dense<[[0, 1]]> : tensor<1x2xindex>
%11 = "FHELinalg.apply_mapped_lookup_table"(%10, %cst_10, %cst_11) : (tensor<1x2x!FHE.eint<8>>, tensor<2x256xi64>, tensor<1x2xindex>) -> tensor<1x2x!FHE.eint<8>>
return %11 : tensor<1x2x!FHE.eint<8>>

We notice that we have calls to FHELinalg.apply_mapped_lookup_table and FHELinalg.apply_lookup_table. These calls apply PBS to the cells of their input tensors. Their inputs in the listing above are: tensor<1x2x!FHE.eint<8>> for the first and last call and tensor<1x50x!FHE.eint<8>> for the two calls in the middle. Thus PBS is applied 104 times.

Getting the bitwidth of the circuit is then simply:

print(quantized_numpy_module.forward_fhe.graph.maximum_integer_bit_width())

Decreasing the number of bits and the number of PBS induces large reductions in the computation time of the compiled circuit.

Advanced topics

Quantization

Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers).

This means that some accuracy in the representation is lost (e.g. a simple approach is to eliminate least-significant bits), but, in many cases in machine learning, it is possible to adapt the models to give meaningful results while using these smaller data types. This significantly reduces the number of bits necessary for intermediary results during the execution of these machine learning models.

Since FHE is currently limited to 8-bit integers, it is necessary to quantize models to make them compatible. As a general rule, the smaller the precision models use, the better the FHE performance.

Overview

Let be the range of our value to quantize where is the minimum and is the maximum. To quantize a range with floating point values (in ) to integer values (in ), we first need to choose the data type that is going to be used. Concrete, the framework used by Concrete-ML, is currently limited to 8-bit integers, so this will be the value used in this example. Knowing the number of bits that can be used for a value in the range , we can compute the scale of the quantization:

where is the number of bits (). For the sake of example, let's take .

In practice, the quantization scale is then . This means the gap between consecutive representable values cannot be smaller than , which, in turn, means there can be a substantial loss of precision. Every interval of length will be represented by a value within the range .

The other important parameter from this quantization schema is the zero point value. This essentially brings the 0 floating point value to a specific integer. If the quantization scheme is asymmetric (quantized values are not centered in 0), the resulting integer will be in .

When using quantized values in a matrix multiplication or convolution, the equations for computing the result become more complex. The IntelLabs distiller quantization documentation provides a more of the maths to quantize values and how to keep computations consistent.

Quantization implemented in Concrete-ML is done in two ways:

The quantization is done automatically during the model's FHE compilation process. This approach requires little work by the user, but may not be a one-size-fits-all solution for all types of models. The final quantized model is FHE friendly and ready to predict over encrypted data. This approach is done using Post-Training Quantization (PTQ).
In some cases (when doing extreme quantization) PTQ is not sufficient to achieve a decent final model accuracy. Concrete-ML offer the possibility for the user to do quantization before compilation to FHE, for example using Quantization-Aware Training (QAT). This can be done by any means, including by using third-party frameworks. In this approach, the user is responsible for implementing a full-integer model respecting the FHE constraints.

Quantizing data

Concrete-ML has support for quantized ML models as well as quantization tools. The core of this functionality is the conversion of floating point values to integers. This is done using QuantizedArray in concrete.ml.quantization.

The QuantizedArray class takes several arguments that determine how float values are quantized (see the for more information):

n_bits that defines the precision of the quantization
values are floating point values that will be converted to integers
is_signed determines if the quantized integer values should allow negative values
is_symmetric determines if the range of floating point values to be quantized should be taken as symmetric around zero

It is also possible to use symmetric quantization, where the integer values are centered around 0:

Quantizing machine learning models

Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in the same way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.

Model inputs and outputs

The models implemented in Concrete-ML provide features to let the user quantize the input data and dequantize the output data.

Here is a simple example showing how to perform inference, starting from float values and ending up with float values. Note that the FHE engine that is compiled for the ML models does not support data batching.

The functions quantize_input and dequantize_output make use of QuantizedArray described above. When the ML model quantized_module is calibrated, the min and max of the value distributions will be stored and applied to quantize/dequantize new data.

In the following example, QuantizedArray is used in a different way, using pre-quantized integer values and having the scale and zero-point set explicitly from calibration parameters. Once the QuantizedArray is constructed, calling dequant() will compute the floating point values corresponding to the integer values qvalues, which are the output of the forward_fhe.encrypt_run_decrypt(..) call.

Adding new quantized layers

Intermediary values computed during model inference might need to be re-scaled into the quantized domain of a subsequent model operator. For example, the output of a convolution layer in a neural network might have values that are 8 bits wide, with the next convolutional layer requiring that its inputs are at most 2 bits wide. In the non-encrypted realm, this implies that we need to make use of floating point operations. To make this work with integers as required by FHE, Concrete-ML uses a table lookup (TLU), which is a . Table lookups are expensive in FHE, and so should only be used when necessary.

The operations done by the activation function of a previous layer and additional re-scaling to the new quantized domain, which are all floating point operations, . Concrete-ML implements quantized operators that perform this fusion, significantly reducing the number of TLUs necessary to perform inference.

There are 3 types of operators:

Operators that perform linear combinations of encrypted and constant (clear) values. For example: matrix multiplication, convolution, addition
Operators that perform element-wise operations between two encrypted tensors. For example: addition
Element-wise, fixed-function operators which can be: addition with a constant, activation functions

The following example shows how to use the _prepare_inputs_with_constants helper function with quantize_actual_values=True to apply the quantization function to the input data of the Gemm operator. Since the quantization function uses floats and a non-linear function (round), a TLU will automatically be generated together with quantization.

TLU generation for element-wise operations can be delegated to Concrete-Numpy directly by calling the function's corresponding NumPy implementation, as defined in .

Resources

IntelLabs distiller explanation of quantization:
Lei Mao's blog on quantization:
Google paper on neural network quantization and integer-only inference:

Pruning

Pruning is a method to reduce neural network complexity, usually applied in order reduce the computation cost or memory size. Pruning is used in Concrete-ML to control the size of accumulators in neural networks, thus making them FHE compatible. See here for an explanation of the accumulator bitwidth constraints.

In neural networks, a neuron computes a linear combination of inputs and learned weights, then applies an activation function.

The neuron computes:

$y_k = \phi\left(\sum_i w_ix_i\right)$

When building a full neural network, each layer will contain multiple neurons, which are connected to the neuron outputs of a previous layer or to the inputs.

For every neuron shown in each layer of the figure above, the linear combinations of inputs and learned weights are computed. Depending on the values of the inputs and weights, the sum $v_k = \sum_i w_ix_i$ - which for Concrete-ML neural networks is computed with integers - can take a range of different values.

To respect the bit width constraint of the FHE Table Lookup, the values of the accumulator $v_k$ must remain small to be representable with only 8 bits. In other words, the values must be between 0 and 255.

Pruning a neural network entails fixing some of the weights $w_k$ to be zero during training. This is advantageous to meet FHE constraints, as irrespective of the distribution of $x_i$ , multiplying these input values by 0 does not increase the accumulator value.

Fixing some of the weights to 0 makes the network graph look more similar to the following:

While pruning weights can reduce the prediction performance of the neural network, studies show that a high level of pruning (above 50%) can often be applied. See here how Concrete-ML uses pruning in Fully Connected Neural Networks.

Production Deployment

Concrete-ML provides functionality to deploy FHE machine learning models in a client/server setting. The deployment workflow and model serving follows the following pattern:

Deployment

The training of the model and its compilation to FHE are performed on a development machine. Three different files are created when saving the model:

client.json; contains the secure cryptographic parameters needed for the client to generate the private and evaluation keys
server.json; contains the compiled model. This file is sufficient to run the model on a server.
serialized_processing.json; contains the metadata about the pre and post processing, such as quantization parameters to quantize the input and dequantize the output.

The compiled model (server.zip) is deployed to a server and the cryptographic parameters (client.zip) along with the model meta data (serialized_processing.json) are shared with the clients.

Serving

The client obtains the cryptographic parameters (using client.zip) and generates a private encryption/decryption key as well as a set of public evaluation keys. The public evaluation keys are then sent to the server, while the secret key remains on the client.

The private data is then encrypted using serialized_processing.json by the client and sent to the server. Server-side, the FHE model inference is ran on the encrypted inputs using the public evaluation keys.

The encrypted result is then returned by the server to the client, which decrypts it using its private key. Finally, the client performs any necessary post-processing of the decrypted result using serialized_processing.json.

Example notebook

For a complete example, see this notebook

Compilation

Concrete-ML implements machine model inference using Concrete-Numpy as a backend. In order to execute in FHE, a numerical program written in Concrete-Numpy needs to be compiled. This functionality is , and Concrete-ML hides away most of the complexity of this step. The entire compilation process is done by Concrete-Numpy.

From the perspective of the Concrete-ML user, the compilation process performed by Concrete-Numpy can be broken up into 3 steps:

numpy program tracing and creation of a Concrete-Numpy op-graph
checking that the op-graph is FHE compatible
producing machine code for the op-graph. This step automatically determines cryptographic parameters

Additionally, the packages the result of the last step in a way that allows to deploy the encrypted circuit to a server and to perform key generation, encryption and decryption on the client side.

Concrete-Numpy op-graphs and the Virtual Library

The first step in the list above takes a python function implemented using the Concrete-Numpy and transforms it into an executable operation graph. In this step all the floating point subgraphs in the op-graph are fused and converted to Table Lookup operations.

This enables to:

execute the op-graph, which includes TLUs, on clear non-encrypted data. This is, of course, not secure, but is much faster than executing in FHE. This mode is useful for debugging. This is called the Virtual Library.
verify the maximum bitwidth of the op-graph, to determine FHE compatibility, without actually compiling the circuit to machine code. This feature is available through Concrete-Numpy and is part of the overall .

Bitwidth compatibility verification

The second step of compilation checks that the maximum bitwidth encountered anywhere in the circuit is valid.

If the check fails for a machine learning model, the user will need to tweak the available , and model hyperparameters in order to obtain FHE compatibility. The Virtual Library is useful in this setting, as described in the

Compilation to machine code

Finally, the FHE compatible op-graph and the necessary cryptographic primitives from Concrete-Framework are converted to machine code.

More about ONNX

Internally, Concrete-ML uses operators as intermediate representation (or IR) for manipulating machine learning models produced through export for , and .

As ONNX is becoming the standard exchange format for neural networks, this allows Concrete-ML to be flexible while also making model representation manipulation quite easy. In addition, it allows for straight-forward mapping to NumPy operators, supported by Concrete-Numpy to use the Concrete stack FHE conversion capabilities.

Torch to NumPy conversion using ONNX

The diagram below gives an overview of the steps involved in the conversion of an ONNX graph to a FHE compatible format, i.e. a format that can be compiled to FHE through Concrete-Numpy.

All Concrete-ML builtin models follow the same pattern for FHE conversion:

The models are trained with sklearn or torch
All models have a torch implementation for inference. This implementation is provided either by third-party tool such as , or is implemented in Concrete-ML.
The torch model is exported to ONNX. For more information on the use of ONNX in Concrete-ML see
The Concrete-ML ONNX parser checks that all the operations in the ONNX graph are supported and assigns reference numpy operations to them. This step produces a NumpyModule
Quantization is performed on the , producing a . Two steps are performed: calibration and assignment of equivalent objects to each ONNX operation. The QuantizedModule class is the quantized counterpart of the NumpyModule.
Once the QuantizedModule is built, Concrete-Numpy is used to trace the ._forward() function of the QuantizedModule

Moreover, by passing a user provided nn.Module to step 2 of the above process, Concrete-ML supports custom user models. See the associated for instructions about working with such models.

Once an ONNX model is imported, it is converted to a NumpyModule, then to a QuantizedModule and, finally, to an FHE circuit. However, as the diagram shows, it is perfectly possible to stop at the NumpyModule level if you just want to run the torch model as NumPy code without doing quantization.

Note that if you keep the obtained NumpyModule without quantizing it with Post Training Quantization (PTQ), it will not be convertible to FHE since the Concrete stack requires operators to use integers for computations.

The NumpyModule stores the ONNX model that it interprets. The interpreter works by going through the ONNX graph in , and storing the intermediate results as it goes. To execute a node, the interpreter feeds the required inputs - taken either from the model inputs or the intermediate results - to the NumPy implementation of each ONNX node.

Calibration

Calibration is the process of executing the NumpyModule with a representative set of data, in floating point. It allows to compute statistics for all the intermediate tensors used in the network to determine quantization parameters.

Note that the NumpyModule interpreter currently .

Quantization

Quantization is the process of converting floating point weights, inputs and activations to integer, according to the quantization parameters computed during Calibration.

Initializers (model trained parameters) are quantized according to n_bits and passed to the Post Training Quantization (PTQ) process.

During the PTQ process, the ONNX model stored in the NumpyModule is interpreted and calibrated using ONNX_OPS_TO_QUANTIZED_IMPL dictionary, which maps ONNX operators (e.g. Gemm) to their quantized equivalent (e.g. QuantizedGemm). For more information on implementing these operations, please see the .

Quantized operators are then used to create a QuantizedModule that, similarly to the NumpyModule, runs through the operators to perform the quantized inference with integers-only operations.

That QuantizedModule is then compilable to FHE if the intermediate values conform to the 8 bits precision limit of the Concrete stack.

Inspecting the ONNX models

In order to better understand how Concrete-ML works under the hood, it is possible to access each model in their ONNX format and then either either print it or visualize it by importing the associated file in . For example, with LogisticRegression:

FHE Op-graphs

The section gave an overview of the conversion of a generic ONNX graph to an FHE compatible Concrete-ML op-graph. This section describes the implementation of the operations in the Concrete-ML op-graph and the way floating point can be used in some parts of the op-graphs through table lookup operations.

Float vs. quantized operations

Concrete, the underlying implementation of TFHE that powers Concrete-ML, enables two types of operations on integers:

arithmetic operations: __ addition of two encrypted values and multiplication of encrypted values with clear scalars. These are used for example in dot-products, matrix multiplication (linear layers), and convolution
table lookup operations (TLU): using an encrypted value as an index, return the value of a lookup table at that index. This is implemented using Programmable Bootstrapping (PBS). This operation is used to perform any non-linear computation such as activation functions, quantization, normalization

Since machine learning models use floating point inputs and weights, they first need to be converted to integer using .

Alternatively, it is possible to use a table lookup to avoid the quantization of the entire graph, by converting floating-point ONNX subgraphs into lambdas, and computing their corresponding lookup tables to be evaluated directly in FHE. This operator fusion technique only requires the input and output of the lambdas to be integers.

For example in the following graph, there is a single input, which must be an encrypted integer tensor. The following series of univariate functions is then fed into a matrix multiplication (MatMul) and fused into a single table lookup with integer inputs and outputs.

ONNX operations

Concrete-ML implements ONNX operations using Concrete-Numpy, which can handle floating point operations, as long as they can be fused to an integer lookup table. The ONNX operations implementations are based on the QuantizedOp class.

There are two modes of creation of a single table lookup for a chain of ONNX operations:

float mode: when the operation can be fused
mixed float/integer: when the ONNX operation needs to perform arithmetic operations

Thus, QuantizedOp instances may need to quantize their inputs or the result of their computation, depending on their position in the graph.

The QuantizedOp class provides a generic implementation of an ONNX operation, including quantization of inputs and outputs, with the computation implemented in numpy in ops_impl.py. We can picture at the architecture of the QuantizedOp as the following structure:

This figure shows that the QuantizedOp has a body that implements the computation of the operation, following the . The operation's body can take either integer or float inputs and can output float or integer values. Two quantizers are attached to the operation, one that takes float inputs and produces the integer inputs, and one that does the same for the output.

Operations that can fuse to a TLU

Depending on the position of the op in the graph and its inputs, the QuantizedOp can be fully fused to a TLU.

Many ONNX ops are trivially univariate, as they multiply variable inputs with constants, or apply univariate functions such as ReLU, Sigmoid, etc. This includes operations between the input and the MatMul in the graph above (subtraction, comparison, multiplication, etc. between inputs and constants).

Operations that work on integers

Operations such as matrix multiplication of encrypted inputs with a constant matrix, or convolution with constant weights, require that the encrypted inputs be integers. In this case, the input quantizer of the QuantizedOp is applied. These types of operation are implemented with a class that derives from QuantizedOp and implements q_impl, such as QuantizedGemm and QuantizedConv.

Operations that produce graph outputs

Finally, some operations produce graph outputs which must be integers. These operations thus need to quantize their outputs as follows:

The diagram above shows that both float ops and integer ops need to quantize their outputs to integer, when placed at the end of the graph.

Putting it all together

To chain the operation types described above, following the ONNX graph, Concrete-ML constructs a function that calls the q_impl of the QuantizedOp instances in the graph in sequence, and uses Concrete-Numpy to trace the execution and compile to FHE. Thus, in this chain of function calls, all groups of that instructions that operate in floating point will be fused to table lookups (TLUs). In FHE this lookup table is computed with a PBS.

The red contours show the groups of elementary Concrete Numpy instructions that will be converted to TLUs.

Note that the input is slightly different from the QuantizedOp. Since the encrypted function takes integers as inputs, the input needs to be dequantized first.

Implementing a `QuantizedOp`

QuantizedOp is the base class for all ONNX quantized operators. It abstracts away many things to allow easy implementation of new quantized ops.

Determining if the operation can be fused

The QuantizedOp class exposes a function can_fuse that

helps to determine the type of implementation that will be traced
determines whether operations further in the graph, that depend on the results of this operation, can fuse

In most case ONNX ops have a single variable input and one or more constant inputs.

When the op implements elementwise operations between the inputs and constants (addition, subtract, multiplication, etc), the operation can be fused to a TLU. Thus, by default in QuantizedOp the can_fuse function returns True.

When the op implements operations that mix the various scalars in the input encrypted tensor, the operation can not fuse, as table lookups are univariate. Thus operations such as QuantizedGemm, QuantizedConv return False in can_fuse.

Some operations may be found in both settings above. A mechanism is implemented in Concrete-ML to determine if the inputs of a QuantizedOp are produced by a unique integer tensor. Thus, the can_fuse function of some QuantizedOp types (addition, subtraction) will allow fusion to take place if both operands are produced by a unique integer tensor:

Case 1: A floating point version of the op is sufficient

You can check ops_impl.py to see how some operations are implemented in with NumPy. The declaration convention for these operations is as follows:

The required inputs should be positional arguments only before the /, which marks the limit of the positional arguments
The optional inputs should be positional or keyword arguments between the / and *, which marks the limits of positional or keyword arguments
The operator attributes should be keyword arguments only after the *

The proper use of positional/keyword arguments is required to allow the QuantizedOp class to properly populate metadata automatically. It uses Python inspect modules and stores relevant information for each argument related to its positional/keyword status. This allows using the Concrete-NumPy implementation as specifications for QuantizedOp, which removes some data duplication and allows having a single source of truth for QuantizedOp and ONNX NumPy implementations.

In that case (unless the quantized implementation requires special handling like QuantizedGemm), you can just set _impl_for_op_named to the name of the ONNX op for which the quantized class is implemented (this uses the mapping ONNX_OPS_TO_NUMPY_IMPL in onnx_utils.py to get the correct implementation).

Case 2: An integer implementation of the op is necessary

Providing an integer implementation requires sub-classing QuantizedOp to create a new operation. This sub-class must override q_impl in order to provide an integer implementation. QuantizedGemm is an example of such a case where quantized matrix multiplication requires proper handling of scales and zero points. The q_impl of that class reflects that.

In the body of q_impl, in order to obtain quantized integer values you can use the _prepare_inputs_with_constants function as such:

Here, prepared_inputs will contain one or more QuantizedArray of which the qvalues are the quantized integers.

Once the required integer processing code in implemented, the output of the q_impl function must be a implemented as a single QuantizedArray. Most commonly, this is built using the dequantized results of the processing done in q_impl.

Case 3: Both a floating point and an integer implementation are necessary

In this case, in q_impl you can check wether the current operation can be fused by calling self.can_fuse(). You can then have both a floating point and an integer implementation, the traced execution path will depend on can_fuse():

Using Hummingbird

Hummingbird is a third party open-source library that converts machine learning models into tensor computations. Many algorithms (see supported algorithms) are converted using a specific backend (torch, torchscript, ONNX and TVM).

Concrete-ML allows the conversion of an ONNX inference to NumPy inference (note that NumPy is always the entry point to run models in FHE with Concrete ML).

Usage

Hummingbird exposes a convert function that can be imported as follows from the hummingbird.ml package:

# Disable Hummingbird warnings for pytest.
import warnings
warnings.filterwarnings("ignore")
from hummingbird.ml import convert

This function can be used to convert a machine learning model to an ONNX as follows:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Instantiate the logistic regression from sklearn
lr = LogisticRegression()

# Create synthetic data
X, y = make_classification(
    n_samples=100, n_features=20, n_classes=2
)

# Fit the model
lr.fit(X, y)

# Convert the model to ONNX
onnx_model = convert(lr, backend="onnx", test_input=X).model

In theory, the resulting onnx_model could be used directly within Concrete-ML's get_equivalent_numpy_forward method (as long as all operators present in the ONNX model are implemented in NumPy) and get the NumPy inference.

In practice, there are some steps needed to clean the ONNX output and make the graph compatible with Concrete-ML, such as applying quantization where needed or deleting/replacing non-FHE friendly ONNX operators (such as Softmax and ArgMax).

Using Skorch

Concrete-ML uses Skorch to implement multi-layer, fully-connected torch neural networks in a way that is compatible with the Scikit-learn API.

This wrapper implements Torch training boilerplate code, alleviating the work that needs to be done by the user. It is possible to add hooks during the training phase, for example once an epoch is finished.

Skorch allows the user to easily create a classifier or regressor around a neural network (NN), implemented in Torch as a nn.Module, which is used by Concrete-ML to provide a fully-connected multi-layer NN with a configurable number of layers and optional pruning (see pruning and the neural network documentation for more information).

Under the hood, Concrete-ML uses a Skorch wrapper around a single torch module, SparseQuantNeuralNetImpl. More information can be found in the API guide.

class SparseQuantNeuralNetImpl(nn.Module):
    """Sparse Quantized Neural Network classifier.

Parameter choice

A linear or convolutional layer of an NN will compute a linear combination of weights and inputs (also called a 'multi-sum'). For example, a linear layer will compute:

$\mathsf{output}^k = \sum_i^Nw_{i}^kx_i$

where $k$ is the k-th neuron in the layer. In this case, the sum is taken on a single dimension. A convolutional layer will compute:

$\mathsf{output}_{xy}^{k} = \sum_c^{N}\sum_j^{K_h}\sum_i^{K_w}w_{cji}^kx_{c,y+j,x+i}^k$

where $k$ is the k-th filter of the convolutional layer and $N$ , $K_h$ and $K_w$ are the number of input channels, the kernel height and the kernel width, respectively.

Following the formulas for the resulting bit width of quantized linear combinations described here, it can be seen that the maximum dimensionality of the input and weights that can make the result exceed 8 bits:

$\Omega = \mathsf{floor} \left( \frac{2^{n_{\mathsf{max}}} - 1}{(2^{n_{\mathsf{weights}}} - 1)(2^{n_{\mathsf{inputs}}} - 1)} \right)$

Here, $n_{\mathsf{max}} = 8$ is the maximum precision allowed.

For example, if $n_{\mathsf{weights}} = 2$ and $n_{\mathsf{inputs}} = 2$ with $n_{\mathsf{max}} = 8$ , the worst case is where all inputs and weights are equal to their maximal value $2^2-1=3$ . In this case, there can be at most $\Omega = 28$ elements in the multi-sums.

In practice, the distribution of the weights of a neural network is Gaussian, with many weights either 0 or having a small value. This enables exceeding the worst-case number of active neurons without having to risk overflowing the bitwidth. The parameter n_hidden_neurons_multiplier is multiplied with $\Omega$ to determine the total number of non-zero weights that should be kept in a neuron.

The pruning mechanism is already implemented in SparseQuantNeuralNetImpl, and the user only needs to determine the parameters listed above. They can be chosen in a way that is convenient, e.g. maximizing accuracy.

Increasing n_hidden_neurons_multiplier can lead to improved performance, as long as the compiled NN does not exceed 8 bits of accumulator bitwidth.

Developer Guide

Set Up the Project

Concrete-ML is a Python library, so Python should be installed to develop Concrete-ML. v3.8 and v3.9 are the only supported versions. Concrete-ML also uses Poetry and Make.

First of all, you need to git clone the project:

Some tests require files tracked by git-lfs to be downloaded. To do so please follow the instructions on then run git lfs pull.

Automatic installation

A simple way to have everything installed is to use the development Docker (see the guide). On Linux and macOS you have to run the script in ./script/make_utils/setup_os_deps.sh. Specify the --linux-install-python flag if you want to install python3.8 as well on apt-enabled Linux distributions. The script should install everything you need for Docker and bare OS development (you can first check the content of the file to check what it will do).

For Windows users, the setup_os_deps.sh script does not install dependencies because of how many different installation methods there are/lack of a single package manager.

The first step is to (as some of our dev tools depend on it), then . In addition to installing Python, you are still going to need the following software available on path on Windows, as some of our basic dev tools depend on them:

git
jq
make

Development on Windows only works with the Docker environment. Follow .

Manual Installation

Python

To manually install Python, you can follow guide (alternatively, you can google how to install Python 3.8 (or 3.9)).

Poetry

Poetry is ised as the package manager. It drastically simplifies dependency and environment management. You can follow official guide to install it.

As there is no concrete-compiler package for Windows, only the dev dependencies can be installed. This requires Poetry >= 1.2.

At the time of writing (June 2022), there is only an alpha version of Poetry 1.2 that you can install. Use the official installer to install preview versions.

Make

The dev tools use make to launch the various commands.

On Linux, you can install make from your distribution's preferred package manager.

On macOS, you can install a more recent version of make via brew:

It is possible to install gmake as make. Check this for more info.

On Windows, check .

In the following sections, be sure to use the proper make tool for your system: make, gmake, or other.

Cloning the repository

To get the source code of Concrete-ML, clone the code repository using the link for your favourite communication protocol (ssh or https).

Setting up environment on your host OS

We are going to make use of virtual environments. This helps to keep the project isolated from other Python projects in the system. The following commands will create a new virtual environment under the project directory and install dependencies to it.

The following command will not work on Windows if you don't have Poetry >= 1.2.

Activating the environment

Finally, activate the newly created environment using the following command:

macOS or Linux

Windows

Setting up environment on Docker

Docker automatically creates and sources a venv in ~/dev_venv/

The venv persists thanks to volumes. We also create a volume for ~/.cache to speed up later reinstallations. You can check which Docker volumes exist with:

You can still run all make commands inside Docker (to update the venv, for example). Be mindful of the current venv being used (the name in parentheses at the beginning of your command prompt).

Leaving the environment

After your work is done, you can simply run the following command to leave the environment:

Syncing environment with the latest changes

From time to time, new dependencies will be added to the project or the old ones will be removed. The command below will make sure the project has the proper environment. So run it regularly!

Troubleshooting your environment

In your OS

If you are having issues, consider using the dev Docker exclusively (unless you are working on OS specific bug fixes or features).

Here are the steps you can take on your OS to try and fix issues:

At this point, you should consider using Docker as nobody will have the exact same setup as you. If, however, you need to develop on your OS directly, you can ask us for help but may not get a solution right away.

In Docker

Here are the steps you can take in your Docker to try and fix issues:

If the problem persists at this point, you should ask for help. We're here and ready to assist!

Documentation

Using GitBook

Documentation with GitBook is done mainly by pushing content on GitHub. GitBook then pulls the docs from the repository, and publishes.. In most cases, GitBook is just a reflexion of what is available in GitHub.

There are however some use-cases where we want to modify documentation directly in GitBook (and then, push the modifications to GitHub), for example when the documentation is modified by a person outside of our organization. In this case, a GitHub branch is created, and a GitHub space is associated to it: modifications are done in this space, and automatically pushed to the branch. Once the modifications are done, one can simply create a pull-request, to finally merge modifications on the main branch.

Using Sphinx

Documenation can alternatively be built using Sphinx:

The documentation contains both files written by hand by developers (the .md files) and files automatically created by parsing the source files.

Then to open it go to docs/_build/html/index.html or use the follwing command:

To build and open the docs at the same time, use:

Support and Issues

Concrete-ML is currently in beta, and thus may contains bugs or suboptimal APIs.

Before opening an issue or asking for support, we encourage the users to read this documentation to understand common issues and limations of Concrete-ML, as well as checking the outstanding issues on github.

Furthermore, undefined behavior may occur if inputset, which is internally used by the compilation core to set bit widths of some intermediate data, is not sufficiently representative of the future user inputs. With all the inputs in the inputset, it appears that intermediate data can be represented as an n-bit integer. But, for a particular computation, this same intermediate data needs additional bits to be represented. The FHE execution for this computation will result in an incorrect output, as typically occurs in integer overflows in classical programs.

If you didn't find an answer, you can ask a question on the Zama forum, or in the FHE.org discord.

Submitting an issue

When submitting an issue (here), make sure you can isolate and reproduce the bug, and give us as much information as possible. In addition to the Python script, the following information is useful:

the reproducibility rate you see on your side
any insight you might have on the bug
any workaround you have been able to find

If you would like to contribute to project and send pull requests, take a look at the contributor guide.

Contributing

There are two ways to contribute to Concrete-ML:

You can open issues to report bugs and typos and to suggest ideas.
You can ask to become an official contributor by emailing [email protected]. Only approved contributors can send pull requests (PR), so please make sure to get in touch before you do!

Creating a new branch

Concrete-ML uses a consistent branch naming scheme, and you are expected to follow it as well. Here is the format, along with some examples:

git checkout -b {feat|fix|refactor|test|benchmark|doc|style|chore}/short-description_$issue_id

e.g.

git checkout -b feat/explicit-tlu_11
git checkout -b fix/tracing_indexing_42

Before committing

Conformance

Each commit to Concrete-ML should conform to the standards of the project. You can let the development tools fix some issues automatically with the following command:

make conformance

Conformance can be checked using the following command:

make pcc

Testing

Your code must be well documented, containing tests and not breaking other tests:

make pytest

You need to make sure you get 100% code coverage. The make pytest command checks that by default and will fail with a coverage report at the end should some lines of your code not be executed during testing.

If your coverage is below 100%, you should write more tests and then create the pull request. If you ignore this warning and create the PR, GitHub actions will fail and your PR will not be merged.

There may be cases where covering your code is not possible (an exception that cannot be triggered in normal execution circumstances). In those cases, you may be allowed to disable coverage for some specific lines. This should be the exception rather than the rule, and reviewers will ask why some lines are not covered. If it appears they can be covered, then the PR won't be accepted in that state.

Committing

Concrete-ML uses a consistent commit naming scheme, and you are expected to follow it as well (the CI will make sure you do). The accepted format can be printed to your terminal by running:

make show_scope

e.g.

git commit -m "feat: implement bounds checking"
git commit -m "feat(debugging): add an helper function to draw intermediate representation"
git commit -m "fix(tracing): fix a bug that crashed pytorch tracer"

To learn more about conventional commits, check this page. Just a reminder that commit messages are checked in the comformance step, and rejected if they don't follow the rules.

Rebasing

You should rebase on top of the main branch before you create your pull request. Merge commits are not allowed, so rebasing on main before pushing gives you the best chance of avoiding having to rewrite parts of your PR later if some conflicts arise with other PRs being merged. After you commit your changes to your new branch, you can use the following commands to rebase:

# fetch the list of active remote branches
git fetch --all --prune

# checkout to main
git checkout main

# pull the latest changes to main (--ff-only is there to prevent accidental commits to main)
git pull --ff-only

# checkout back to your branch
git checkout $YOUR_BRANCH

# rebase on top of main branch
git rebase main

# If there are conflicts during the rebase, resolve them
# and continue the rebase with the following command
git rebase --continue

# push the latest version of the local branch to remote
git push --force

You can learn more about rebasing here.

Releases

Before any final release, Concrete-ML contributors go through a release candidate (RC) cycle. The idea is that once the codebase and documentations look ready for a release, you create an RC release by opening an issue with the release template here, starting with version vX.Y.Zrc1 and then with versions vX.Y.Zrc2, vX.Y.Zrc3...

Once the last RC is deemed ready, open an issue with the release template using the last RC version from which you remove the rc? part (i.e. v12.67.19 if your last RC version was v12.67.19-rc4) on github.

API

The APIs of the project are detailed in a pdf.