Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Concrete ML can be run on Linux based OSes as well as macOS on x86 CPUs. These hardware requirements are dictated by Concrete-Lib.
Do note that since WSL on Windows is a Linux based OS, Concrete ML will work as long as the package is not mandated in the /mnt/c/ directory, which corresponds to the host OS filesystem.
To install Concrete-ML from PyPi, run the following:
Note that concrete-ml installs concrete-numpy with all extras, including pygraphviz
to draw graphs.
pygraphviz
requires graphviz
packages being installed on your OS, see https://pygraphviz.github.io/documentation/stable/install.html
graphviz
packages are binary packages that won't automatically be installed by pip. Do check https://pygraphviz.github.io/documentation/stable/install.html for instructions on how to install graphviz
for pygraphviz
.
You can also get the concrete-ml Docker image by either pulling the latest Docker image or a specific version:
The image can be used with Docker volumes, see the Docker documentation here.
You can then use this image with the following command:
This will launch a Concrete-ML enabled Jupyter server in Docker, that you can access from your browser.
Alternatively, you can just open a shell in Docker with or without volumes:
This version of Concrete-ML is a first version of the product, meaning that it is not completely finished, contains several bugs (both known and unknown at this time), and will improve over time with feedback from early users.
Here are some ways to debug your problems. If nothing seems conclusive, you can still report the issue, as explained in a later section of this page.
First, we encourage the user to have a look at:
the error message received
the documentation of the product
the known limits of the product
Once you have determined that the bug was not your own, it is time to go further.
A bug may happen if ever the inputset, which is internally used by the compilation core to set bit widths of some intermediate data, is not sufficiently representative. With all the inputs in the inputset, it appears that intermediate data can be represented as an n
-bit integer. But, for a particular computation, this same intermediate data needs additional bits to be represented. The FHE execution for this computation will result in an incorrect output (as typically occurs in integer overflows in classical programs).
So, in general, when a bug appears, it may be a good idea to enlarge the inputset and try to have random-looking inputs in the latter, following distribution of inputs used with the function.
Once you're sure it is a bug, it would be nice to:
make it highly reproducible by reducing as much of the randomness as possible - If you can find an input which fails, there is no reason to let the input remain random
reduce it to the smallest possible bug - It is easier to investigate bugs which are small, so when you have an issue, please try to reduce it to a smaller issue, notably with fewer lines of code, smaller parameters, less complex functions to compile, and faster scripts, etc.
You can directly ask the developers and community about your issue on our Discourse server (link on the right of the top menu).
Hopefully, it is just a misunderstanding or a small mistake on your side that we can help you fix easily. Additionally, your feedback helps us make the documentation even clearer (by adding to the FAQ, for example).
To simplify our work and let us reproduce your bug easily, we need all the information we can get. So, in addition to your Python script, the following information is useful:
the reproducibility rate you see on your side
any insight you might have on the bug
any workaround you have been able to find
Remember, Concrete-ML is a project where we are open to contributions. You can find more information at Contributing.
In case you have a reproducible bug that you have reduced to a small piece of code, we have our issue tracker in place (link on the right of the top menu). Remember that a well-described short issue is an issue that is more likely to be studied and fixed. The more issues we receive, the better the product will be.
With the current version of the framework, we cannot represent encrypted integers with more than 7 bits. While we are working on supporting larger integers, currently, whenever a floating point model needs to be processed in FHE, quantization is necessary.
In this situation, you will get a compilation error. Here is an example:
When you compile this example, it results in:
Notice that the maximum bit width, determined by the compiler, depends on the inputset passed to the compile_on_inputset
function. In this case, the error is caused by the input value in the inputset that produces a result whose representation requires 9 bits. This input is the value 8, since 8 * 42 = 336, which is a 9-bit value.
You can determine the number of bits necessary to represent an integer value with the formula:
For a more practical example, the MNIST classification task consists of taking an image, a 28x28 array containing uint8 values representing a handwritten digit, and predicting whether it belongs to one of 10 classes: the digits from 0 to 9. The output is a one-hot vector which indicates the class to which a particular sample belongs.
The input contains 28x28x8 bits, so 6272 bits of information. In practice, you could still obtain good results on MNIST by thresholding the pixels to {0, 1} and training a model for this new binarized MNIST task. This means that in a real use case where you actually need to perform digit recognition, you could binarize your input on the fly, replacing each pixel with either 0 or 1. In doing so, you use 1 bit per pixel and now only have 784 bits of input data. It also means that if you are doing some accumulation (adding pixel values together), you are going to need accumulators that are smaller (adding 0s and 1s requires less space than adding values ranging from 0 to 255). An example of MNIST classification with a quantized neural network is given in the CNN advanced example.
This shows how adapting your data or model parameters can allow you to use models that may require smaller data types (i.e. use less precision) to perform their computations.
Binarization is an extreme case of quantization which is introduced here. You can also find further resources on the linked page.
While applying quantization directly to the input features is mandatory to reduce the effective bit width of computations, a different and complementary approach is dimensionality reduction. This can be accomplished through Principal Component Analysis (PCA) as shown in the Poisson Regression example
Quantization and dimensionality reduction reduce the bit width required to run the model and increase execution speed. These two tools are necessary to make models compatible with FHE constraints.
However, quantization and, especially, binarization, induce a loss in the accuracy of the model since its representation power is diminished. Carefully choosing a quantization approach for model parameters can alleviate accuracy loss, all the while allowing compilation to FHE.
The quantization of model parameters and model inputs is illustrated in the advanced examples for Linear Regression and for Logistic Regression. Note that different quantization parameters are used for inputs and for model weights.
Recent quantization literature usually aims to make use of dedicated machine learning accelerators in a mixed setting where a CPU or General Purpose GPU (GPGPU) is also available. Thus, in literature, some floating point computation is assumed to be acceptable. This approach allows us to reach performance similar to those achieved by floating point models. In this popular mixed float-int setting, the input is usually left in floating point. This is also true for the first and last layers, which have more impact on the resulting model accuracy than hidden layers.
However, in Concrete-ML, to respect FHE constraints, the inputs, the weights and the accumulator must all be represented with integers of a maximum of 7 bits.
Thus, in Concrete-ML, we also quantize the input data and network output activations in the same way as the rest of the network: everything is quantized to a specific number of bits. It turns out that the number of bits used for the input or the output of any activation function is crucial to comply with the constraint on accumulator width.
The core operations in neural networks are matrix multiplications (matmul) and convolutions, which both compute linear combinations of inputs (encrypted) and weights (in clear). The linear combination operation must be done such that the maximum value of its result requires at most 7 bits of precision.
Currently, Concrete-ML computes the number of bits needed for the computation depending on the inputset calibration data and does not allow the overflow (see Integer overflow) to happen, raising an exception as shown above.
The following table summarizes the various examples in this section, along with their accuracies.
Model | Dataset | Metric | Clear | Quantized | FHE |
---|
A * means that FHE accuracy was calculated on a subset of the validation set.
In this table, ** means that the accuracy is actually random-like, because the quantization we need to set to fullfill bitsize constraints is too strong.
In neural networks, a neuron computes a linear combination of inputs and learned weights, then applies an activation function.
The neuron computes:
When building a full neural network, each layer will contain multiple neurons, which are connected to the neuron outputs of a previous layer or to the inputs.
For every neuron shown in each layer of the figure above, the linear combinations of inputs and learned weights are computed. Depending on the values of the inputs and weights, the sum - which, for Concrete-ML neural networks, is computed with integers - can take a range of different values.
To respect the bit width constraint of the mechanism, implemented with programmable bootstrapping, the values of the accumulator must remain small to be representable with only 7 bits. In other words, the values must be between 0 and 127.
Pruning a neural network entails fixing some of the weights to be zero during training. This is advantageous to meet FHE constraints, as, irrespective of the distribution of , multiplying these input values by 0 does not increase the accumulator value.
Fixing some of the weights to 0 makes the network graph look more similar to the following:
Pruning weights can reduce the prediction performance of the neural network, but studies show that a high level of pruning (above 50%, see Han, Song & Pool, Jeff & Tran, John & Dally, William. (2015). Learning both Weights and Connections for Efficient Neural Networks) can be applied. In Concrete-ML, we implement with pruning, as described in the .
Concrete-ML is compatible with sklearn APIs such as Pipeline() or GridSearch(), which are popular model selection methods.
Here is a simple example of such a process:
| |
Concrete-ML is an open-source private machine learning inference framework based on fully homomorphic encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from and .
Fully Homomorphic Encryption (FHE) is an encryption technique that allows computating directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in , or by joining the community.
This example shows the typical flow of a Concrete-ML model:
The model is trained on unencrypted (plaintext) data
The resulting model is quantized to small integers using either post-training quantization or quantization-aware training
The quantized model is compiled to a FHE equivalent (under the hood, the model is first converted to a Concrete-Numpy program, then compiled)
Inference can then be done on encrypted data
To make a model work with FHE, the only constrain is to make it run within the supported precision limitations of Concrete-ML (currently 7-bit integers).
Currently, Concrete only supports 7-bit encrypted integer arithmetics. This requires models to be quantized heavily, which sometimes leads to loss of accuracy vs the plaintext model. Furthermore, the Concrete-Compiler is still a work in progress, meaning it won't always find optimal performance parameters, leading to slower than expected execution times.
Additionally, Concrete-ML currently only supports FHE inference. Training on the other hand has to be done on unencrypted data, producing a model which is then converted to an FHE equivalent that can do encrypted inference.
Finally, there is currently no support for pre and post processing in FHE. Data must arrive to the FHE model already pre-processed and post-processing (if there is any) has to be done client-side.
All of these issues are currently being addressed and significant improvements are expected to be released in the coming months.
The interested reader has even more resources to review, in addition to this documentation:
Our , the link for which can be found at the top right of doc pages.
The varied blogs we publish, currently located on . Notably, describes the use of a Poisson regressor to tackle a real-life use case in a privacy-preserving setting.
Additionally, we plan to publish academic and white papers explaining interesting aspects of our work, covering both the engineering and scientific sides of our offering.
While floating point values have 32 bits of precision, machine learning datasets have features that use only a limited range of values. For example, if a feature takes a value that is limited to the range [1, 2), in floating point this value is represented as , where is a number between 1 and 2. Generic floating point representation can support exponents between -126 and 127, allocating 8 bits to store the exponent. In our case, a single exponent value of 0 is needed. Knowing that, for our range, the exponent can only take a single value out for 253 possible ones. We can thus save the 8 bits allocated to the exponent, reducing the bit width necessary. We refer the reader to the IEEE754 standard for more information on floating point representation and to this simulator that helps to understand the topic through practice.
For example, if you quantize your input and weights with , bits of precision, one can compute the maximum dimensionality of the input and weights before the matmul/convolution result could exceed the 7 bits as such:
where is the maximum precision allowed. For example, if we set and with , then we have the different inputs/weights are allowed in the linear combination.
Exceeding dimensions in the input and weights, the risk of overflow increases quickly. It may happen that for some distributions of weights and values the computation does not overflow, but the risk increases rapidly with the number of dimensions.
Model | Dataset | Metric | Clear | Quantized | FHE |
---|
Here is a simple example of encrypted inference using logistic regression. More examples can be found .
Concrete-ML is built on top of Zama’s Concrete framework. It uses , which itself uses the and the . To use these libraries directly, refer to the and documentations.
The Virtual Lib in Concrete-ML is a prototype that provides drop-in replacements for Concrete-Numpy, Compiler and Circuit that allow users to simulate what would happen when converting a model to FHE without the current bit width constraint, or to more quickly simulate the behavior with 7 bits or less as there are no FHE computations.
In other words, you can use the compile functions from the Concrete-ML package by passing use_virtual_lib = True
and using a CompilationConfiguration
with enable_unsafe_features = True
. You will then get a simulated circuit that allows you to use more than the current 7 bits of precision allowed by the Concrete stack. It is also a faster way to measure the potential FHE accuracy with 7 bits or less. It is something we used for the red/blue contours in the Classifier Comparison notebook, as computing in FHE for the whole grid and all the classifiers would be very long.
The Virtual Lib can be useful when developing and iterating on an ML model implementation. For example, you can check that your model is compatible in terms of operands (all integers) with the Virtual Lib compilation. Then, you can check how many bits your ML model would require, which can give you hints as to how it should be modified if you want to compile it to an actual FHE Circuit (not a simulated one) that only supports 7 bits of integer precision.
The Virtual Lib, being pure Python and not requiring crypto key generation, can be much faster than the actual compilation and FHE execution, thus allowing for faster iterations, debugging and FHE simulation, regardless of the bit width used.
Before settling for a final release, we go through a release candidate (RC) cycle. The idea is that once the codebase and documentations look ready for a release, you create an RC release by opening an issue with the release template here, starting with version vX.Y.Zrc1
and then with versions vX.Y.Zrc2
, vX.Y.Zrc3
...
Once the last RC is deemed ready, open an issue with the release template using the last RC version from which you remove the rc?
part (i.e. v12.67.19
if your last RC version was v12.67.19-rc4
) on github.
There are two ways to contribute to Concrete-ML or to Concrete tools in general:
You can open issues to report bugs and typos and to suggest ideas.
You can ask to become an official contributor by emailing hello@zama.ai. Only approved contributors can send pull requests (PR), so please make sure to get in touch before you do!
Let's go over some other important things that you need to be careful about.
We are using a consistent branch naming scheme, and you are expected to follow it as well. Here is the format, along with some examples:
e.g.
Each commit to Concrete-ML should conform to the standards decided by the team.
You can let the development tools fix some issues automatically with the following command:
Conformance can be checked using the following command:
Of course, tests must pass as well.
The last requirement is to make sure you get 100 percent code coverage. The make pytest
command checks that by default and will fail with a coverage report at the end should some lines of your code not be executed during testing.
If your coverage is below 100 percent, you should write more tests and then create the pull request (PR). If you ignore this warning and create the PR, GitHub actions will fail and your PR will not be merged.
There may be cases where covering your code is not possible (an exception that cannot be triggered in normal execution circumstances). In those cases, you may be allowed to disable coverage for some specific lines. This should be the exception rather than the rule, and reviewers will ask why some lines are not covered. If it appears they can be covered, then the PR won't be accepted in that state.
We are using a consistent commit naming scheme, and you are expected to follow it as well (the CI will make sure you do). The accepted format can be printed to your terminal by running:
e.g.
To learn more about conventional commits, check this page. Just a reminder that commit messages are checked in the comformance step, and rejected if they don't follow the rules.
We remind you that only official contributors can send pull requests. To become an official contributor, please email hello@zama.ai.
You should rebase on top of the main
branch before you create your pull request. We don't allow merge commits, so rebasing on main
before pushing gives you the best chance of avoiding having to rewrite parts of your PR later if some conflicts arise with other PRs being merged. After you commit your changes to your new branch, you can use the following commands to rebase:
You can learn more about rebasing here.
One can simply create docs with Sphinx and open them, by doing:
The documentation contains both files written by hand by developers (the .md files) and files automatically created by parsing the source files.
Or simply open docs/_build/html/index.html
.
Remark that a
conveniently builds and opens the doc at the end.
Before you start this section, go ahead and install Docker. You can follow this official guide if you require assistance.
X forwarding means redirecting the display to your host machine screen so that the Docker container can display things on your screen (otherwise you would only get CLI/terminal interface to your container).
To be able to use X forwarding on macOS:
Install XQuartz
Open XQuartz.app and make sure that authorize network connections
is set in the application parameters (currently in the Security settings)
Open a new terminal within XQuartz.app and type:
Now, the X server should be all set in Docker (in the regular terminal).
Install Xming and use Xlaunch:
Multiple Windows, Display number: 0
Start no client
IMPORTANT: Check No Access Control
You can save this configuration to relaunch easily, then click finish.
Once you have access to this repository and the dev environment is installed on your host OS (via make setup_env
once you followed the steps here), you should be able to launch the commands to build the dev Docker image with make docker_build
.
Once you do that, you can get inside the Docker environment using the following command:
After you finish your work, you can leave Docker by using the exit
command or by pressing CTRL + D
.
You will need to first install Python. This can be done automatically for Linux with the rest of the dependencies running the script indicated below with the --linux-install-python
flag. If you want to install some of the dependencies manually, we detail the installations of Poetry and Make.
On Linux and macOS you will have to run the script in ./script/make_utils/setup_os_deps.sh
. Specify the --linux-install-python
flag if you want to install python3.8 as well on apt-enabled Linux distributions. The script should install everything you need for Docker and bare OS development (you can first check the content of the file to check what it will do).
It is strongly recommended to use the development Docker (see the docker guide). However, our helper script should bring all the tools you need to develop directly on Linux and macOS.
For Windows see the Warning admonition below.
The project targets Python 3.8 through 3.9 inclusive.
For Windows users, the setup_os_deps.sh
script does not install dependencies because of how many different installation methods there are/lack of a single package manager.
The first step is to install Python (as some of our dev tools depend on it), then Poetry. In addition to installing Python, you are still going to need the following software available on path on Windows, as some of our basic dev tools depend on them:
Development on Windows only works with the Docker environment. Follow this link to setup the Docker environment.
Concrete ML is a Python
library, so Python
should be installed to develop Concrete ML. v3.8
and v3.9
are the only supported versions.
As stated at the start of this document, you can install Python 3.8 for Linux automatically if it's available in your distribution's apt repository using the ./script/make_utils/setup_os_deps.sh script.
You can follow this guide to install it (alternatively, you can google how to install Python 3.8 (or 3.9)
).
Poetry
is our package manager. It drastically simplifies dependency and environment management.
As stated at the start of this document, you can install Poetry for macOS and Linux automatically using the ./script/make_utils/setup_os_deps.sh script.
You can follow this official guide to install it.
As there is no concrete-compiler
package for Windows, only the dev dependencies can be installed. This requires Poetry >= 1.2.
At the time of writing (March 2022), there is only an alpha version of Poetry 1.2 that you can install. Use the official installer to install preview versions.
The dev tools use make
to launch the various commands.
As stated at the start of this document, you can install make
for macOS and Linux automatically if it's available in your distribution's apt repository using the ./script/make_utils/setup_os_deps.sh script.
On Linux, you can install make
from your distribution's preferred package manager.
On macOS, you can install a more recent version of make
via brew:
It is possible to install gmake
as make
. Check this StackOverflow post for more info.
On Windows, check this GitHub gist.
In the following sections, be sure to use the proper make
tool for your system: make
, gmake
, or other.
Now, it's time to get the source code of Concrete ML.
Clone the code repository using the link for your favourite communication protocol (ssh or https).
We are going to make use of virtual environments. This helps to keep the project isolated from other Python
projects in the system. The following commands will create a new virtual environment under the project directory and install dependencies to it.
The following command will not work on Windows if you don't have Poetry >= 1.2.
Finally, all we need to do is to activate the newly created environment using the following command:
Docker automatically creates and sources a venv in ~/dev_venv/
The venv persists thanks to volumes. We also create a volume for ~/.cache to speed up later reinstallations. You can check which Docker volumes exist with:
You can still run all make
commands inside Docker (to update the venv, for example). Be mindful of the current venv being used (the name in parentheses at the beginning of your command prompt).
After your work is done, you can simply run the following command to leave the environment:
From time to time, new dependencies will be added to the project or the old ones will be removed. The command below will make sure the project has the proper environment. So run it regularly!
If you are having issues, consider using the dev Docker exclusively (unless you are working on OS specific bug fixes or features).
Here are the steps you can take on your OS to try and fix issues:
At this point, you should consider using Docker as nobody will have the exact same setup as you. If, however, you need to develop on your OS directly, you can ask us for help but may not get a solution right away.
Here are the steps you can take in your Docker to try and fix issues:
If the problem persists at this point, you should ask for help. We're here and ready to assist!
Concrete-ML allows you to compile a torch model to its FHE counterpart.
This process executes most of the concepts described in the documentation on how to use quantization and triggers the compilation to be able to run the model over homomorphically encrypted data.
Note that the architecture of the neural network passed to be compiled must respect some hard constraints given by FHE. Please read the our detailed documentation on these limitations.
Once your model is trained, you can simply call the compile_torch_model
function to execute the compilation.
You can then call quantized_numpy_module.forward_fhe.encrypt_run_decrypt()
to have the FHE inference.
Now your model is ready to infer in FHE settings.
fhe_prediction
contains the clear quantized output. The user can now dequantize the output to get the actual floating point prediction as follows:
If you want to see more compilation examples, you can check out the Fully Connected Neural Network
Our torch conversion pipeline uses ONNX and an intermediate representation. We refer the user to the Concrete ML ONNX operator reference for more information.
The following operators in torch will be exported as Concrete-ML compatible ONNX operators:
Operators that take an encrypted input and unencrypted constants:
Note that the equivalent versions from torch.functional
are also supported.
Linear Regression | Synthetic 1D | R2 | 0.876 | 0.863 | 0.863 |
Logistic Regression | Synthetic 2D with 2 classes | accuracy | 0.90 | 0.875 | 0.875 |
Poisson Regression | mean Poisson deviance | 1.38 | 1.68 | 1.68 |
Decision Tree | precision score | 0.95 | 0.97 | 0.97* |
XGBoost | MCC | 0.48 | 0.52 | 0.52* |
Fully Connected NN | accuracy | 0.947 | 0.895 | 0.895 |
Convolutional NN | accuracy | 0.90 | ** | ** |
Artificial Neuron (from: ) |
Fully Connected Neural Network |
Pruned Fully Connected Neural Network |
Our primary concern in this release was the ease of adoption of our framework. That is why we built APIs, which should feel natural to data scientists. While performance is also an important concern for deployment of FHE machine learning models, improvements on this front will come in future releases.
To this end, we have decided to mimic the APIs of scikit-learn and XGBoost for machine learning models (linear models and tree-based models) and of torch for deep learning models. We refer readers to and to , which show how similar our APIs are to their non-FHE counterparts.
from :
Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers).
Modern computing has been using data types that are 32 or 64 bits wide for many years, for both integers and floating point values. Even bigger data types are available or can be constructed easily. However, due to the costly nature of FHE computations (see ), using such types with FHE is impractical (or plain impossible) if we are to execute computations in a reasonable amount of time.
The basic idea of quantization is to take a range of values that are represented by a large data type and represent them using a single value of a smaller data type. This means that some accuracy in the representation is lost (e.g. a simple approach is to eliminate least-significant bits), but, in many cases in machine learning, it is possible to adapt the models to give meaningful results while using these smaller data types. This significantly reduces the number of bits necessary for intermediary results during the execution of these machine learning models.
Let's first define some notations. Let be the range of our value to quantize where is the minimum and is the maximum.
To quantize a range with floating point values (in ) to integer values (in ), we first need to choose the data type that is going to be used. Concrete-Library, the backend library used by Concrete-ML, is currently limited to 7-bit integers, so we'll use this value for the example. Knowing the number of bits that can be used, for a value in the range , we can compute the scale
of the quantization:
where is the number of bits (here, 7).
In practice, the quantization scale is then . This means the gap between consecutive representable values cannot be smaller than , which, in turn, means there can be a substantial loss of precision. Every interval of length will be represented by a value within the range .
The other important parameter from this quantization schema is the zero point
value. This essentially brings the 0 floating point value to a specific integer. If the quantization scheme is asymmetric (quantized values are not centered in 0), the resulting integer will be in .
Regarding quantization in Concrete-ML and FHE compilation, it is important to understand the difference between two approaches:
The quantization is done automatically during the model compilation stage (inside our framework). This approach requires little work by the user, but may not be a one-size-fits-all solution for all types of models that a user may want to implement.
The quantization is done by the user, before compilation to FHE; notably, the quantization is completely controlled by the user, and can be done by any means, including by using third-party frameworks. In this approach, the user is responsible for implementing their models directly with NumPy.
For the moment, the first method is applicable through the tools provided by in Concrete-ML, and the models implemented in our framework make use of this approach. When quantization is only performed in the compilation stage, the model training stage does not take into account that the model will be quantized. This setting is called Post-Training Quantization (PTQ), and this is the approach currently taken in Concrete-ML. PTQ is effective for moderate bit widths, such as 7-8 bits per weight and activation, but, for a model to be compatible with FHE constraints, we must quantize these values to as few as 2-3 bits. Thus, for models with more than a few neurons per layer, PTQ is not the optimal solution, and we plan to implement a more performant approach called Quantization Aware Training in the near future.
Internally, Concrete-ML uses operators as intermediate representation (or IR) for manipulating machine learning models produced through export for , and . As ONNX is becoming the standard exchange format for neural networks, this allows Concrete-ML to be flexible while also making model representation manipulation quite easy. In addition, it allows for straight-forward mapping to NumPy operators, supported by Concrete-Numpy to use the Concrete stack FHE conversion capabilities.
Here we list the operators that are supported as well as the operators that have a quantized version, which should allow you to perform automatic Post Training Quantization (PTQ) of your models.
Please note that due to the current precision constraints from the Concrete stack, PTQ may produce circuits that have worse accuracy than your original model.
The following operators should be supported for evaluation and conversion to an equivalent NumPy circuit. As long as your model converts to an ONNX using these operators, it should be convertible to an FHE equivalent.
Do note that all operators may not be fully supported for conversion to a circuit executable in FHE. You will get error messages should you use such an operator in a circuit you are trying to convert to FHE.
Abs
Acos
Acosh
Add
Asin
Asinh
Atan
Atanh
Celu
Clip
Constant
Conv
Cos
Cosh
Div
Elu
Equal
Erf
Exp
Gemm
Greater
HardSigmoid
Identity
LeakyRelu
Less
Log
MatMul
Mul
Not
Relu
Reshape
Selu
Sigmoid
Sin
Sinh
Softplus
Sub
Tan
Tanh
ThresholdedRelu
Abs: QuantizedAbs
Add: QuantizedAdd
Celu: QuantizedCelu
Clip: QuantizedClip
Conv: QuantizedConv
Elu: QuantizedElu
Exp: QuantizedExp
Gemm: QuantizedGemm
HardSigmoid: QuantizedHardSigmoid
Identity: QuantizedIdentity
LeakyRelu: QuantizedLeakyRelu
Linear: QuantizedLinear
Log: QuantizedLog
MatMul: QuantizedMatMul
Relu: QuantizedRelu
Reshape: QuantizedReshape
Selu: QuantizedSelu
Sigmoid: QuantizedSigmoid
Softplus: QuantizedSoftplus
Tanh: QuantizedTanh
When using quantized values in a matrix multiplication or convolution, the equations for computing the result are more involved. The IntelLabs distiller quantization documentation provides a more of the maths to quantize values and how to keep computations consistent.
We detail the use of quantization within Concrete-ML .
IntelLabs distiller explanation of quantization:
Lei Mao's blog on quantization:
Google paper on neural network quantization and integer-only inference:
Concrete-ML is built on top of Zama’s Concrete stack. It uses Concrete-Numpy, which itself uses the Concrete-Compiler.
The Concrete-Compiler takes MLIR code as input representing a computation circuit and compiles it to an executable using Concrete primitives to perform the computations.
We refer the reader to Concrete-Numpy documentation and, more generally, to the documentation of the whole Concrete-Framework for more information.
Hummingbird contains an interesting feature for Concrete-ML: it converts many algorithms (see supported algorithms) to tensor computations using a specific backend (torch, torchscript, ONNX and TVM).
Concrete-ML allows the conversion of an ONNX inference to NumPy inference (note that NumPy is always our entry point to run models in FHE).
We use a simple functionnality of Hummingbird, which is the convert
function that can be imported as follows from the hummingbird.ml
package:
This function can be used to convert a machine learning model to an ONNX as follows:
In theory, we can directly use this onnx_model
within our get_equivalent_numpy_forward
(as long as all operators present in the ONNX model are implemented in NumPy) and get the NumPy inference.
In practice, we have some steps to clean the ONNX and make the graph compatible with our framework such as:
applying quantization where needed
deleting non-FHE friendly ONNX operators, such as Softmax and ArgMax
We use skorch to implement multi-layer, fully-connected torch neural networks in Concrete-ML in a way that is compatible with the scikit-learn API.
This wrapper implements torch training boilerplate code, alleviating the work that needs to be done by the user. It is possible to add hooks during the training phase, for example once an epoch is finished.
Skorch allows the user to easily create a classifier or regressor around a neural network (NN), implemented in Torch as a nn.Module
. We provide a simple, fully-connected, multi-layer NN with a configurable number of layers and optional pruning (see pruning).
The SparseQuantNeuralNetImpl
class implements this neural network. Please see the documentation on this class in the API guide.
The constructor of this class takes some parameters that influence FHE compatibility:
n_w_bits
(default 3): number of bits for weights
n_a_bits
(default 3): number of bits for activations and inputs
n_accum_bits
(default 7): maximum accumulator bit width to impose through pruning
n_hidden_neurons_multiplier
(default 4): explained below
A linear or convolutional layer of an NN will compute a linear combination of weights and inputs (we also call this a 'multi-sum'). For example, a linear layer will compute:
where is the k-th neuron in the layer. In this case, the sum is taken on a single dimension. A convolutional layer will compute:
where is the k-th filter of the convolutional layer and , , are the number of input channels, the kernel height and the kernel width, respectively.
Following the formulas for the resulting bit width of quantized linear combinations described here, notably the maximum dimensionality of the input and weights that can make the result exceed 7 bits:
where is the maximum precision allowed.
For example, we set and with . The worst case is a scenario where all inputs and weights are equal to their maximal value . The formula above tells us that, in this case, we can afford at most elements in the multi-sums detailed above.
In a practical setting, the distribution of the weights of a neural network is Gaussian. Thus, there will be weights that are equal to 0 and many weights will have small values. In a typical scenario, we can exceed the worst-case number of active neurons. The parameter n_hidden_neurons_multiplier
is a factor that is multiplied with to determine the total number of non-zero weights that should be kept in a neuron.
The pruning mechanism is already implemented in SparseQuantNeuralNetImpl
, and the user only needs to determine the parameters listed above. They can choose them in a way that is convenient, e.g. maximizing accuracy.
The skorch wrapper requires that all the parameters that will be passed to the wrapped nn.Module
be prefixed with module__
. For example, the code to create an FHE-compatible Concrete-ML fully-connected NN classifier for a dataset with 10 input dimensions and two classes, will thus be:
We could then increase n_hidden_neurons_multiplier
to improve performance, taking care to verify that the compiled NN does not exceed 7 bits of accumulator bit width.
A similar example is given in the classifier comparison notebook.
In this section, we detail the usage of quantization in Concrete-ML.
Since quantization is necessary to make ML models work in FHE, Concrete-ML implements quantized ML models to facilitate usage, but also exposes some quantization tools. The core of this functionality is the conversion of floating point values to integers, following the techniques described here. We can apply this conversion using QuantizedArray
, available in concrete.ml.quantization
.
The QuantizedArray
class takes several arguments that determine how float values are quantized:
n_bits
that defines the precision of the quantization
values
are floating point values that will be converted to integers
is_signed
determines if the quantized integer values should allow negative values
is_symmetric
determines if the range of floating point values to be quantized should be taken as symmetric around zero
Please see the API reference for more information.
We can also use symmetric quantization, where the integer values are centered around 0 and may, thus, take negative values.
Machine learning models are implemented with a diverse set of operations, such as convolution, linear transformations, activation functions and element-wise operations. When working with quantized values, these operations cannot be carried out in the same way as for floating point values. With quantization, it is necessary to re-scale the input and output values of each operation to fit in the quantization domain.
The ML models implemented in Concrete-ML provide features to let the user quantize the input data and dequantize the output data.
Here is a simple example showing how to perform inference, starting from float values and ending up with float values. Note that the FHE engine that is compiled for the ML models does not support data batching.
If we are to examine the operations done by quantize_input
and dequantize_output
, we will see usage of the QuantizedArray
described above. When the ML model quantized_module
is calibrated, the min and max of the value distributions will be recorded, and these are then applied to quantize/dequantize new data.
Here, a different usage of QuantizedArray
is shown, where it is constructed from quantized integer values and the scale
and zero-point
are set explicitly from calibration parameters. Once the QuantizedArray
is constructed, calling dequant()
will compute the floating point values corresponding to the integer values qvalues
, which are the output of the forward_fhe.encrypt_run_decrypt(..)
call.
Intermediary values computed during model inference might need to be re-scaled into the quantized domain of a subsequent model operator. For example, the output of a convolution layer in a neural network might have values that are 7 bits wide, but the next convolutional layer requires that its inputs are, at most, 2 bits wide. In the non-encrypted realm, this implies that we need to make use of floating point operations. In the FHE setting, where we only work with integers, this could be a problem, but, luckily, the FHE implementation behind Concrete-ML provides a solution. We essentially make use of a table lookup, which is later translated into a Programmable Bootstrap (PBS).
Of course, having a PBS for every quantized addition isn't recommended for computational cost reasons. Also, a PBS is currently only allowed for univariate operations (i.e. matrix multiplication can't be done in a PBS). Therefore, our quantized modules split the computation of floating point values and unsigned integers, as it is currently done in concrete.ml.quantization.QuantizedLinear
. Moreover, the operations done by the activation function of a previous layer and additional re-scaling to the new quantized domain, which are all floating point operations, can be fused to a single TLU. Concrete-ML implements quantized operators that perform this fusion, significantly reducing the number of TLUs necessary to perform inference.
We can distinguish three types of operators:
Operators that perform linear combinations of encrypted and constant (clear) values. For example: matrix multiplication, convolution, addition
Operators that perform element-wise operations between two encrypted tensors. For example: addition
Element-wise, fixed-function operators which can be: addition with a constant, activation functions
In the first category, we will find operators such as Gemm
, which will quantize their inputs. Notice that here we use the _prepare_inputs_with_constants
helper function, with quantize_actual_values=True
, to apply the quantization function to the input data. The quantization function operators using floating point and a non-linear function, round
, will thus produce a TLU, together with any preceding floating point operations.
For element-wise operations with a fixed function, we simply let Concrete-Numpy generate a TLU. To do so, we just need to give this function the corresponding NumPy implementation, which must be defined in ops_impl.py.
It was decided to use ONNX as the intermediate format to convert various ML models (including torch nn.Module and various sklearn models, among others) to NumPy. The reason here is that converting/interpreting torchscript and other representations would require a lot of effort while ONNX has tools readily available to easily manipulate the model's representation in Python. Additionally, JAX had an example of a lightweight interpreter to run ONNX models as NumPy code.
In the diagram above, it is perfectly possible to stop at the NumpyModule
level if you just want to run the torch model as NumPy code without doing quantization.
Note that if you keep the obtained NumpyModule
without quantizing it with Post Training Quantization (PTQ), it is very likely that it won't be convertible to FHE since the Concrete stack requires operators to use integers for computations.
The NumpyModule
stores the ONNX model that it interprets. The interpreter works by going through the ONNX graph (which, by specification, is sorted in topological order, allowing users to run through the graph without having to care for evaluation order) and storing the intermediate results as it goes. To execute a node, the interpreter feeds the required inputs - taken either from the model inputs or the intermediate results - to the NumPy implementation of each ONNX node.
Do note that the NumpyModule
interpreter currently supports the following ONNX operators.
Initializers (ONNX's parameters) are quantized according to n_bits
and passed to the Post Training Quantization (PTQ) process.
During the PTQ process, the ONNX model stored in the NumpyModule
is interpreted and calibrated using the supported ONNX operators for PTQ.
Quantized operators are then used to create a QuantizedModule
that, similarly to the NumpyModule
, runs through the operators to perform the quantized inference with integers-only operations.
That QuantizedModule
is then compilable to FHE if the intermediate values conform to the 7 bits precision limit of the Concrete stack.
QuantizedOp
QuantizedOp
is the base class for all ONNX quantized operators. It abstracts away a lot of things to allow easy implementation of new quantized ops.
You can check ops_impl.py
to see how implementations are done in NumPy. The requirements are as follows:
The required inputs should be positional arguments only before the /
, which marks the limit of the positional arguments
The optional inputs should be positional or keyword arguments between the /
and *
, which marks the limits of positional or keyword arguments
The operator attributes should be keyword arguments only after the *
The proper use of positional/keyword arguments is required to allow the QuantizedOp
class to properly populate metadata automatically. It uses Python inspect modules and stores relevant information for each argument related to its positional/keyword status. This allows us to use our NumPy implementation as specifications for QuantizedOp
, which removes some data duplication and allows us to have a single source of truth for QuantizedOp
and ONNX NumPy implementations.
In that case (unless the quantized implementation requires special handling like QuantizedGemm
), you can just set _impl_for_op_named
to the name of the ONNX op for which the quantized class is implemented (this uses the mapping ONNX_OPS_TO_numpy_IMPL
we have in onnx_utils.py
to get the right implementation).
If you want to provide an alternative implementation, you can set _impl_for_op_named
to the name of the operator (e.g. Exp
) and you can set impl
and/or q_impl
to the functions that will do the alternative handling. QuantizedGemm
is an example of such a case where quantized matrix multiplication requires proper handling of scales and zero points. The q_impl
of that class reflects that.