1 of 5

Inner Workings

Importing ONNX

Internally, Concrete ML uses ONNX operators as intermediate representation (or IR) for manipulating machine learning models produced through export for PyTorch, Hummingbird, and skorch.

As ONNX is becoming the standard exchange format for neural networks, this allows Concrete ML to be flexible while also making model representation manipulation easy. In addition, it allows for straight-forward mapping to NumPy operators, supported by Concrete to use Concrete stack's FHE-conversion capabilities.

Torch to NumPy conversion using ONNX

The diagram below gives an overview of the steps involved in the conversion of an ONNX graph to an FHE-compatible format (i.e., a format that can be compiled to FHE through Concrete).

All Concrete ML built-in models follow the same pattern for FHE conversion:

The models are trained with sklearn or PyTorch.
All models have a PyTorch implementation for inference. This implementation is provided either by a third-party tool such as Hummingbird or implemented directly in Concrete ML.
The PyTorch model is exported to ONNX. For more information on the use of ONNX in Concrete ML, see here.
The Concrete ML ONNX parser checks that all the operations in the ONNX graph are supported and assigns reference NumPy operations to them. This step produces a NumpyModule.
Quantization is performed on the NumpyModule, producing a QuantizedModule. Two steps are performed: calibration and assignment of equivalent QuantizedOp objects to each ONNX operation. The QuantizedModule class is the quantized counterpart of the NumpyModule.
Once the QuantizedModule is built, Concrete is used to trace the ._forward() function of the QuantizedModule.

Moreover, by passing a user provided nn.Module to step 2 of the above process, Concrete ML supports custom user models. See the associated FHE-friendly model documentation for instructions about working with such models.

Once an ONNX model is imported, it is converted to a NumpyModule, then to a QuantizedModule and, finally, to an FHE circuit. However, as the diagram shows, it is perfectly possible to stop at the NumpyModule level if you just want to run the PyTorch model as NumPy code without doing quantization.

Note that the NumpyModule interpreter currently supports the following ONNX operators.

Inspecting the ONNX models

In order to better understand how Concrete ML works under the hood, it is possible to access each model in their ONNX format and then either print it or visualize it by importing the associated file in Netron. For example, with LogisticRegression:

import onnx
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from concrete.ml.sklearn import LogisticRegression

# Create the data for classification
x, y = make_classification(n_samples=250, class_sep=2, n_features=30, random_state=42)

# Retrieve train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.4, random_state=42
)

# Fix the number of bits to used for quantization
model = LogisticRegression(n_bits=8)

# Fit the model
model.fit(X_train, y_train)

# Access to the model
onnx_model = model.onnx_model

# Print the model
print(onnx.helper.printable_graph(onnx_model.graph))

# Save the model
onnx.save(onnx_model, "tmp.onnx")

# And then visualize it with Netron

Quantization Tools

Quantizing data

Concrete ML has support for quantized ML models and also provides quantization tools for Quantization Aware Training and Post-Training Quantization. The core of this functionality is the conversion of floating point values to integers and back. This is done using QuantizedArray in concrete.ml.quantization.

The QuantizedArray class takes several arguments that determine how float values are quantized:

n_bits defines the precision used in quantization
values are floating point values that will be converted to integers
is_signed determines if the quantized integer values should allow negative values
is_symmetric determines if the range of floating point values to be quantized should be taken as symmetric around zero

See also the UniformQuantizer reference for more information:

from concrete.ml.quantization import QuantizedArray
import numpy
numpy.random.seed(0)
A = numpy.random.uniform(-2, 2, 10)
print("A = ", A)
# array([ 0.19525402,  0.86075747,  0.4110535,  0.17953273, -0.3053808,
#         0.58357645, -0.24965115,  1.567092 ,  1.85465104, -0.46623392])
q_A = QuantizedArray(7, A)
print("q_A.qvalues = ", q_A.qvalues)
# array([ 37,          73,          48,         36,          9,
#         58,          12,          112,        127,         0])
# the quantized integers values from A.
print("q_A.quantizer.scale = ", q_A.quantizer.scale)
# 0.018274684777173276, the scale S.
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# 26, the zero point Z.
print("q_A.dequant() = ", q_A.dequant())
# array([ 0.20102153,  0.85891018,  0.40204307,  0.18274685, -0.31066964,
#         0.58478991, -0.25584559,  1.57162289,  1.84574316, -0.4751418 ])
# Dequantized values.

It is also possible to use symmetric quantization, where the integer values are centered around 0:

q_A = QuantizedArray(3, A)
print("Unsigned: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Unsigned: q_A.qvalues =  [2 4 2 2 0 3 0 6 7 0]
# q_A.quantizer.zero_point =  1

q_A = QuantizedArray(3, A, is_signed=True, is_symmetric=True)
print("Signed Symmetric: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Signed Symmetric: q_A.qvalues =  [ 0  1  1  0  0  1  0  3  3 -1]
# q_A.quantizer.zero_point =  0

In the following example, showing the de-quantization of model outputs, the QuantizedArray class is used in a different way. Here it uses pre-quantized integer values and has the scale and zero-point set explicitly. Once the QuantizedArray is constructed, calling dequant() will compute the floating point values corresponding to the integer values qvalues, which are the output of the fhe_circuit.encrypt_run_decrypt(..) call.

import numpy
from concrete.ml.quantization.quantizers import QuantizationOptions

q_values = [0, 0, 1, 2, 3, -1]
QuantizedArray(
        q_A.quantizer.n_bits,
        q_values,
        value_is_float=False,
        options=q_A.quantizer.quant_options,
        stats=q_A.quantizer.quant_stats,
        params=q_A.quantizer.quant_params,
).dequant()

Quantized modules

Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions, and element-wise operations. When working with quantized values, these operations cannot be carried out in an equivalent way to floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.

In Concrete ML, the quantized equivalent of a scikit-learn model or a PyTorch nn.Module is the QuantizedModule. Note that only inference is implemented in the QuantizedModule, and it is built through a conversion of the inference function of the corresponding scikit-learn or PyTorch module.

Built-in neural networks expose the quantized_module member, while a QuantizedModule is also the result of the compilation of custom models through compile_torch_model and compile_brevitas_qat_model.

The quantized versions of floating point model operations are stored in the QuantizedModule. The ONNX_OPS_TO_QUANTIZED_IMPL dictionary maps ONNX floating point operators (e.g., Gemm) to their quantized equivalent (e.g., QuantizedGemm). For more information on implementing these operations, please see the FHE-compatible op-graph section.

The computation graph is taken from the corresponding floating point ONNX graph exported from scikit-learn using HummingBird, or from the ONNX graph exported by PyTorch. Calibration is used to obtain quantized parameters for the operations in the QuantizedModule. Parameters are also determined for the quantization of inputs during model deployment.

Calibration is the process of determining the typical distributions of values encountered for the intermediate values of a model during inference.

To perform calibration, an interpreter goes through the ONNX graph in topological order and stores the intermediate results as it goes. The statistics of these values determine quantization parameters.

That QuantizedModule generates the Concrete function that is compiled to FHE. The compilation will succeed if the intermediate values conform to the 16-bits precision limit of the Concrete stack. See the compilation section for details.

Resources

Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

FHE Op-graph Design

The ONNX import section gave an overview of the conversion of a generic ONNX graph to an FHE-compatible Concrete ML op-graph. This section describes the implementation of operations in the Concrete ML op-graph and the way floating point can be used in some parts of the op-graphs through table lookup operations.

Float vs. quantized operations

Concrete, the underlying implementation of TFHE that powers Concrete ML, enables two types of operations on integers:

arithmetic operations: the addition of two encrypted values and multiplication of encrypted values with clear scalars. These are used, for example, in dot-products, matrix multiplication (linear layers), and convolution.
table lookup operations (TLU): using an encrypted value as an index, return the value of a lookup table at that index. This is implemented using Programmable Bootstrapping. This operation is used to perform any non-linear computation such as activation functions, quantization, and normalization.

Since machine learning models use floating point inputs and weights, they first need to be converted to integers using quantization.

Alternatively, it is possible to use a table lookup to avoid the quantization of the entire graph, by converting floating-point ONNX subgraphs into lambdas and computing their corresponding lookup tables to be evaluated directly in FHE. This operator-fusion technique only requires the input and output of the lambdas to be integers.

For example, in the following graph there is a single input, which must be an encrypted integer tensor. The following series of univariate functions is then fed into a matrix multiplication (MatMul) and fused into a single table lookup with integer inputs and outputs.

ONNX operations

Concrete ML implements ONNX operations using Concrete, which can handle floating point operations, as long as they can be fused to an integer lookup table. The ONNX operations implementations are based on the QuantizedOp class.

There are two modes of creation of a single table lookup for a chain of ONNX operations:

float mode: when the operation can be fused
mixed float/integer: when the ONNX operation needs to perform arithmetic operations

Thus, QuantizedOp instances may need to quantize their inputs or the result of their computation, depending on their position in the graph.

The QuantizedOp class provides a generic implementation of an ONNX operation, including the quantization of inputs and outputs, with the computation implemented in NumPy in ops_impl.py. It is possible to picture the architecture of the QuantizedOp as the following structure:

This figure shows that the QuantizedOp has a body that implements the computation of the operation, following the ONNX spec. The operation's body can take either integer or float inputs and can output float or integer values. Two quantizers are attached to the operation: one that takes float inputs and produces integer inputs and one that does the same for the output.

Operations that can fuse to a TLU

Depending on the position of the op in the graph and its inputs, the QuantizedOp can be fully fused to a TLU.

Many ONNX ops are trivially univariate, as they multiply variable inputs with constants or apply univariate functions such as ReLU, Sigmoid, etc. This includes operations between the input and the MatMul in the graph above (subtraction, comparison, multiplication, etc. between inputs and constants).

Operations that work on integers

Operations, such as matrix multiplication of encrypted inputs with a constant matrix or convolution with constant weights, require that the encrypted inputs be integers. In this case, the input quantizer of the QuantizedOp is applied. These types of operations are implemented with a class that derives from QuantizedOp and implements q_impl, such as QuantizedGemm and QuantizedConv.

Operations that produce graph outputs

Finally, some operations produce graph outputs, which must be integers. These operations need to quantize their outputs as follows:

The diagram above shows that both float ops and integer ops need to quantize their outputs to integers when placed at the end of the graph.

Putting it all together

To chain the operation types described above following the ONNX graph, Concrete ML constructs a function that calls the q_impl of the QuantizedOp instances in the graph in sequence, and uses Concrete to trace the execution and compile to FHE. Thus, in this chain of function calls, all groups of that instruction that operate in floating point will be fused to TLUs. In FHE, this lookup table is computed with a PBS.

The red contours show the groups of elementary Concrete instructions that will be converted to TLUs.

Note that the input is slightly different from the QuantizedOp. Since the encrypted function takes integers as inputs, the input needs to be de-quantized first.

Implementing a `QuantizedOp`

QuantizedOp is the base class for all ONNX-quantized operators. It abstracts away many things to allow easy implementation of new quantized ops.

Determining if the operation can be fused

The QuantizedOp class exposes a function can_fuse that:

helps to determine the type of implementation that will be traced.
determines whether operations further in the graph, that depend on the results of this operation, can fuse.

In most cases, ONNX ops have a single variable input and one or more constant inputs.

When the op implements element-wise operations between the inputs and constants (addition, subtract, multiplication, etc), the operation can be fused to a TLU. Thus, by default in QuantizedOp, the can_fuse function returns True.

When the op implements operations that mix the various scalars in the input encrypted tensor, the operation cannot fuse, as table lookups are univariate. Thus, operations such as QuantizedGemm and QuantizedConv return False in can_fuse.

Some operations may be found in both settings above. A mechanism is implemented in Concrete ML to determine if the inputs of a QuantizedOp are produced by a unique integer tensor. Therefore, the can_fuse function of some QuantizedOp types (addition, subtraction) will allow fusion to take place if both operands are produced by a unique integer tensor:

def can_fuse(self) -> bool:
    return len(self._int_input_names) == 1

Case 1: A floating point version of the op is sufficient

You can check ops_impl.py to see how some operations are implemented in NumPy. The declaration convention for these operations is as follows:

The required inputs should be positional arguments only before the /, which marks the limit of the positional arguments.
The optional inputs should be positional or keyword arguments between the / and *, which marks the limits of positional or keyword arguments.
The operator attributes should be keyword arguments only after the *.

The proper use of positional/keyword arguments is required to allow the QuantizedOp class to properly populate metadata automatically. It uses Python inspect modules and stores relevant information for each argument related to its positional/keyword status. This allows using the Concrete implementation as specifications for QuantizedOp, which removes some data duplication and generates a single source of truth for QuantizedOp and ONNX-NumPy implementations.

In that case (unless the quantized implementation requires special handling like QuantizedGemm), you can just set _impl_for_op_named to the name of the ONNX op for which the quantized class is implemented (this uses the mapping ONNX_OPS_TO_NUMPY_IMPL in onnx_utils.py to get the correct implementation).

Case 2: An integer implementation of the op is necessary

Providing an integer implementation requires sub-classing QuantizedOp to create a new operation. This sub-class must override q_impl in order to provide an integer implementation. QuantizedGemm is an example of such a case where quantized matrix multiplication requires proper handling of scales and zero points. The q_impl of that class reflects this.

In the body of q_impl, you can use the _prepare_inputs_with_constants function in order to obtain quantized integer values:

from concrete.ml.quantization import QuantizedArray

def q_impl(
    self,
    *q_inputs: QuantizedArray,
    **attrs,
) -> QuantizedArray:

    # Retrieve the quantized inputs
    prepared_inputs = self._prepare_inputs_with_constants(
        *q_inputs, calibrate=False, quantize_actual_values=True
    )

Here, prepared_inputs will contain one or more QuantizedArray, of which the qvalues are the quantized integers.

Once the required integer processing code is implemented, the output of the q_impl function must be implemented as a single QuantizedArray. Most commonly, this is built using the de-quantized results of the processing done in q_impl.

    result = (
        sum_result.astype(numpy.float32) - q_input.quantizer.zero_point
    ) * q_input.quantizer.scale

    return QuantizedArray(
        self.n_bits,
        result,
        value_is_float=True,
        options=self.input_quant_opts,
        stats=self.output_quant_stats,
        params=self.output_quant_params,
    )

Case 3: Both a floating point and an integer implementation are necessary

In this case, in q_impl you can check whether the current operation can be fused by calling self.can_fuse(). You can then have both a floating-point and an integer implementation. The traced execution path will depend on can_fuse():


def q_impl(
    self,
    *q_inputs: QuantizedArray,
    **attrs,
) -> QuantizedArray:

    execute_in_float = len(self.constant_inputs) > 0 or self.can_fuse()

    # a floating point implementation that can fuse
    if execute_in_float:
        prepared_inputs = self._prepare_inputs_with_constants(
            *q_inputs, calibrate=False, quantize_actual_values=False
        )

        result = prepared_inputs[0] + self.b_sign * prepared_inputs[1]
        return QuantizedArray(
            self.n_bits,
            result,
            # ......
        )
    else:
        prepared_inputs = self._prepare_inputs_with_constants(
            *q_inputs, calibrate=False, quantize_actual_values=True
        )
        # an integer implementation follows, see Case 2
        # ....

External Libraries

Hummingbird

Hummingbird is a third-party, open-source library that converts machine learning models into tensor computations, and it can export these models to ONNX. The list of supported models can be found in the Hummingbird documentation.

Concrete ML allows the conversion of an ONNX inference to NumPy inference (note that NumPy is always the entry point to run models in FHE with Concrete ML).

Hummingbird exposes a convert function that can be imported as follows from the hummingbird.ml package:

# Disable Hummingbird warnings for pytest.
import warnings
warnings.filterwarnings("ignore")
from hummingbird.ml import convert

This function can be used to convert a machine learning model to an ONNX as follows:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Instantiate the logistic regression from sklearn
model = LogisticRegression()

# Create synthetic data
X, y = make_classification(
    n_samples=100, n_features=20, n_classes=2
)

# Fit the model
model.fit(X, y)

# Convert the model to ONNX
onnx_model = convert(model, backend="onnx", test_input=X).model

In theory, the resulting onnx_model could be used directly within Concrete ML's get_equivalent_numpy_forward method (as long as all operators present in the ONNX model are implemented in NumPy) and get the NumPy inference.

In practice, there are some steps needed to clean the ONNX output and make the graph compatible with Concrete ML, such as applying quantization where needed or deleting/replacing non-FHE friendly ONNX operators (such as Softmax and ArgMax).

skorch

Concrete ML uses skorch to implement multi-layer, fully-connected PyTorch neural networks in a way that is compatible with the scikit-learn API.

This wrapper implements Torch training boilerplate code, lessening the work required of the user. It is possible to add hooks during the training phase, for example once an epoch is finished.

skorch allows the user to easily create a classifier or regressor around a neural network (NN), implemented in Torch as a nn.Module, which is used by Concrete ML to provide a fully-connected, multi-layer NN with a configurable number of layers and optional pruning (see pruning and the neural network documentation for more information).

Under the hood, Concrete ML uses a skorch wrapper around a single PyTorch module, SparseQuantNeuralNetwork. More information can be found in the API guide.

class SparseQuantNeuralNetImpl(nn.Module):
    """Sparse Quantized Neural Network classifier.

Brevitas

Brevitas is a quantization aware learning toolkit built on top of PyTorch. It provides quantization layers that are one-to-one equivalents to PyTorch layers, but also contain operations that perform the quantization during training.

While Brevitas provides many types of quantization, for Concrete ML, a custom "mixed integer" quantization applies. This "mixed integer" quantization is much simpler than the "integer only" mode of Brevitas. The "mixed integer" network design is defined as:

all weights and activations of convolutional, linear and pooling layers must be quantized (e.g., using Brevitas layers, QuantConv2D, QuantAvgPool2D, QuantLinear)
PyTorch floating-point versions of univariate functions can be used (e.g., torch.relu, nn.BatchNormalization2D, torch.max (encrypted vs. constant), torch.add, torch.exp). See the PyTorch supported layers page for a full list.

The "mixed integer" mode used in Concrete ML neural networks is based on the "integer only" Brevitas quantization that makes both weights and activations representable as integers during training. However, through the use of lookup tables in Concrete ML, floating point univariate PyTorch functions are supported.

For "mixed integer" quantization to work, the first layer of a Brevitas nn.Module must be a QuantIdentity layer. However, you can then use functions such as torch.sigmoid on the result of such a quantizing operation.

import torch.nn as nn

class QATnetwork(nn.Module):
    def __init__(self):
        super(QATnetwork, self).__init__()
        self.quant_inp = qnn.QuantIdentity(
            bit_width=4, return_quant_tensor=True)
        # ...

    def forward(self, x):
        out = self.quant_inp(x)
        return torch.sigmoid(out)
        # ...

For examples of such a "mixed integer" network design, please see the Quantization Aware Training examples:

You can also refer to the SparseQuantNeuralNetImpl class, which is the basis of the built-in NeuralNetworkClassifier.

Quantization Tools

Quantizing data

The QuantizedArray class takes several arguments that determine how float values are quantized:

n_bits defines the precision used in quantization
values are floating point values that will be converted to integers
is_signed determines if the quantized integer values should allow negative values
is_symmetric determines if the range of floating point values to be quantized should be taken as symmetric around zero

See also the UniformQuantizer reference for more information:

from concrete.ml.quantization import QuantizedArray
import numpy
numpy.random.seed(0)
A = numpy.random.uniform(-2, 2, 10)
print("A = ", A)
# array([ 0.19525402,  0.86075747,  0.4110535,  0.17953273, -0.3053808,
#         0.58357645, -0.24965115,  1.567092 ,  1.85465104, -0.46623392])
q_A = QuantizedArray(7, A)
print("q_A.qvalues = ", q_A.qvalues)
# array([ 37,          73,          48,         36,          9,
#         58,          12,          112,        127,         0])
# the quantized integers values from A.
print("q_A.quantizer.scale = ", q_A.quantizer.scale)
# 0.018274684777173276, the scale S.
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# 26, the zero point Z.
print("q_A.dequant() = ", q_A.dequant())
# array([ 0.20102153,  0.85891018,  0.40204307,  0.18274685, -0.31066964,
#         0.58478991, -0.25584559,  1.57162289,  1.84574316, -0.4751418 ])
# Dequantized values.

It is also possible to use symmetric quantization, where the integer values are centered around 0:

q_A = QuantizedArray(3, A)
print("Unsigned: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Unsigned: q_A.qvalues =  [2 4 2 2 0 3 0 6 7 0]
# q_A.quantizer.zero_point =  1

q_A = QuantizedArray(3, A, is_signed=True, is_symmetric=True)
print("Signed Symmetric: q_A.qvalues = ", q_A.qvalues)
print("q_A.quantizer.zero_point = ", q_A.quantizer.zero_point)
# Signed Symmetric: q_A.qvalues =  [ 0  1  1  0  0  1  0  3  3 -1]
# q_A.quantizer.zero_point =  0

import numpy
from concrete.ml.quantization.quantizers import QuantizationOptions

q_values = [0, 0, 1, 2, 3, -1]
QuantizedArray(
        q_A.quantizer.n_bits,
        q_values,
        value_is_float=False,
        options=q_A.quantizer.quant_options,
        stats=q_A.quantizer.quant_stats,
        params=q_A.quantizer.quant_params,
).dequant()

Quantized modules

Calibration is the process of determining the typical distributions of values encountered for the intermediate values of a model during inference.

Resources

Lei Mao's blog on quantization: Quantization for Neural Networks
Google paper on neural network quantization and integer-only inference: Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

FHE Op-graph Design

Float vs. quantized operations

Concrete, the underlying implementation of TFHE that powers Concrete ML, enables two types of operations on integers:

arithmetic operations: the addition of two encrypted values and multiplication of encrypted values with clear scalars. These are used, for example, in dot-products, matrix multiplication (linear layers), and convolution.
table lookup operations (TLU): using an encrypted value as an index, return the value of a lookup table at that index. This is implemented using Programmable Bootstrapping. This operation is used to perform any non-linear computation such as activation functions, quantization, and normalization.

Since machine learning models use floating point inputs and weights, they first need to be converted to integers using quantization.

ONNX operations

There are two modes of creation of a single table lookup for a chain of ONNX operations:

float mode: when the operation can be fused
mixed float/integer: when the ONNX operation needs to perform arithmetic operations

Thus, QuantizedOp instances may need to quantize their inputs or the result of their computation, depending on their position in the graph.

Operations that can fuse to a TLU

Depending on the position of the op in the graph and its inputs, the QuantizedOp can be fully fused to a TLU.

Operations that work on integers

Operations that produce graph outputs

Finally, some operations produce graph outputs, which must be integers. These operations need to quantize their outputs as follows:

The diagram above shows that both float ops and integer ops need to quantize their outputs to integers when placed at the end of the graph.

Putting it all together

The red contours show the groups of elementary Concrete instructions that will be converted to TLUs.

Note that the input is slightly different from the QuantizedOp. Since the encrypted function takes integers as inputs, the input needs to be de-quantized first.

Implementing a `QuantizedOp`

QuantizedOp is the base class for all ONNX-quantized operators. It abstracts away many things to allow easy implementation of new quantized ops.

Determining if the operation can be fused

The QuantizedOp class exposes a function can_fuse that:

helps to determine the type of implementation that will be traced.
determines whether operations further in the graph, that depend on the results of this operation, can fuse.

In most cases, ONNX ops have a single variable input and one or more constant inputs.

def can_fuse(self) -> bool:
    return len(self._int_input_names) == 1

Case 1: A floating point version of the op is sufficient

You can check ops_impl.py to see how some operations are implemented in NumPy. The declaration convention for these operations is as follows:

The required inputs should be positional arguments only before the /, which marks the limit of the positional arguments.
The optional inputs should be positional or keyword arguments between the / and *, which marks the limits of positional or keyword arguments.
The operator attributes should be keyword arguments only after the *.

Case 2: An integer implementation of the op is necessary

In the body of q_impl, you can use the _prepare_inputs_with_constants function in order to obtain quantized integer values:

from concrete.ml.quantization import QuantizedArray

def q_impl(
    self,
    *q_inputs: QuantizedArray,
    **attrs,
) -> QuantizedArray:

    # Retrieve the quantized inputs
    prepared_inputs = self._prepare_inputs_with_constants(
        *q_inputs, calibrate=False, quantize_actual_values=True
    )

Here, prepared_inputs will contain one or more QuantizedArray, of which the qvalues are the quantized integers.

    result = (
        sum_result.astype(numpy.float32) - q_input.quantizer.zero_point
    ) * q_input.quantizer.scale

    return QuantizedArray(
        self.n_bits,
        result,
        value_is_float=True,
        options=self.input_quant_opts,
        stats=self.output_quant_stats,
        params=self.output_quant_params,
    )

Case 3: Both a floating point and an integer implementation are necessary


def q_impl(
    self,
    *q_inputs: QuantizedArray,
    **attrs,
) -> QuantizedArray:

    execute_in_float = len(self.constant_inputs) > 0 or self.can_fuse()

    # a floating point implementation that can fuse
    if execute_in_float:
        prepared_inputs = self._prepare_inputs_with_constants(
            *q_inputs, calibrate=False, quantize_actual_values=False
        )

        result = prepared_inputs[0] + self.b_sign * prepared_inputs[1]
        return QuantizedArray(
            self.n_bits,
            result,
            # ......
        )
    else:
        prepared_inputs = self._prepare_inputs_with_constants(
            *q_inputs, calibrate=False, quantize_actual_values=True
        )
        # an integer implementation follows, see Case 2
        # ....

Inner Workings

Importing ONNX

Torch to NumPy conversion using ONNX

Inspecting the ONNX models

Quantization Tools

Quantizing data

Quantized modules

Resources

FHE Op-graph Design

Float vs. quantized operations

ONNX operations

Operations that can fuse to a TLU

Operations that work on integers

Operations that produce graph outputs

Putting it all together

Implementing a QuantizedOp

Determining if the operation can be fused

Case 1: A floating point version of the op is sufficient

Case 2: An integer implementation of the op is necessary

Case 3: Both a floating point and an integer implementation are necessary

External Libraries

Hummingbird

skorch

Brevitas

External Libraries

Hummingbird

skorch

Brevitas

Quantization Tools

Quantizing data

Quantized modules

Resources

FHE Op-graph Design

Float vs. quantized operations

ONNX operations

Operations that can fuse to a TLU

Operations that work on integers

Operations that produce graph outputs

Putting it all together

Implementing a QuantizedOp

Determining if the operation can be fused

Case 1: A floating point version of the op is sufficient

Case 2: An integer implementation of the op is necessary

Case 3: Both a floating point and an integer implementation are necessary

Importing ONNX

Torch to NumPy conversion using ONNX

Inspecting the ONNX models

Implementing a `QuantizedOp`

Implementing a `QuantizedOp`