Run on GPU
TFHE-rs now includes a GPU backend, featuring a CUDA implementation for performing integer arithmetics on encrypted data. In what follows, a simple tutorial is introduced: it shows how to update your existing program to use GPU acceleration, or how to start a new one using GPU.
Prerequisites
Importing to your project
To use the TFHE-rs GPU backend
in your project, you first need to add it as a dependency in your Cargo.toml
.
If you are using an x86
machine:
If you are using an ARM
machine:
When running code that uses TFHE-rs
, it is highly recommended to run in release mode with cargo's --release
flag to have the best possible performance
Supported platforms
TFHE-rs GPU backend is supported on Linux (x86, aarch64).
OS | x86 | aarch64 |
---|---|---|
Linux |
|
|
macOS | Unsupported | Unsupported* |
Windows | Unsupported | Unsupported |
A first example
Configuring and creating keys.
In comparison with the CPU example, the only difference lies into the key creation, which is detailed here
Here is a full example (combining the client and server parts):
Setting the keys
The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the Client (which is assumed to run on a CPU), the server key has then to be decompressed by the Server to be converted into the right format. To do so, the server should run this function: decompressed_to_gpu()
. From then on, there is no difference between the CPU and the GPU.
Encrypting data
On the client-side, the method to encrypt the data is exactly the same than the CPU one, i.e.:
Computation.
The server must first set its keys up, like in the CPU, with: set_server_key(gpu_key);
. Then, homomorphic computations are done with the same code than the one described here.
Decryption.
Finally, the client gets the decrypted results by computing:
Improving performance.
TFHE-rs includes the possibility to leverage the high number of threads given by a GPU. To do so, the configuration should be updated with Rust let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
The complete example becomes:
List of available operations
The GPU backend includes the following operations:
name | symbol |
|
|
Neg |
| N/A | |
Add |
| ||
Sub |
| ||
Mul |
| ||
Div |
| ||
Rem |
| ||
Not |
| N/A | |
BitAnd |
| ||
BitOr |
| ||
BitXor |
| ||
Shr |
| ||
Shl |
| ||
Rotate right |
| ||
Rotate left |
| ||
Min |
| ||
Max |
| ||
Greater than |
| ||
Greater or equal than |
| ||
Lower than |
| ||
Lower or equal than |
| ||
Equal |
| ||
Cast (into dest type) |
| N/A | |
Cast (from src type) |
| N/A | |
Ternary operator |
|
All operations follow the same syntax than the one described in here.
Benchmarks
The tables below contain benchmarks for homomorphic operations running on a single V100 from AWS (p3.2xlarge machines), with the default parameters:
Operation \ Size | FheUint8 | FheUint16 | FheUint32 | FheUint64 | FheUint128 | FheUint256 |
---|---|---|---|---|---|---|
cuda_add | 103.33 ms | 129.26 ms | 156.83 ms | 186.99 ms | 320.96 ms | 528.15 ms |
cuda_bitand | 26.11 ms | 26.21 ms | 26.63 ms | 27.24 ms | 43.07 ms | 65.01 ms |
cuda_bitor | 26.1 ms | 26.21 ms | 26.57 ms | 27.23 ms | 43.05 ms | 65.0 ms |
cuda_bitxor | 26.08 ms | 26.21 ms | 26.57 ms | 27.25 ms | 43.06 ms | 65.07 ms |
cuda_eq | 52.82 ms | 53.0 ms | 79.4 ms | 79.58 ms | 96.37 ms | 145.25 ms |
cuda_ge | 104.7 ms | 130.23 ms | 156.19 ms | 183.2 ms | 213.43 ms | 288.76 ms |
cuda_gt | 104.93 ms | 130.2 ms | 156.33 ms | 183.38 ms | 213.47 ms | 288.8 ms |
cuda_le | 105.14 ms | 130.47 ms | 156.48 ms | 183.44 ms | 213.33 ms | 288.75 ms |
cuda_lt | 104.73 ms | 130.23 ms | 156.2 ms | 183.14 ms | 213.33 ms | 288.74 ms |
cuda_max | 156.7 ms | 182.65 ms | 210.74 ms | 251.78 ms | 316.9 ms | 442.71 ms |
cuda_min | 156.85 ms | 182.67 ms | 210.39 ms | 252.02 ms | 316.96 ms | 442.95 ms |
cuda_mul | 219.73 ms | 302.11 ms | 465.91 ms | 955.66 ms | 2.71 s | 9.15 s |
cuda_ne | 52.72 ms | 52.91 ms | 79.28 ms | 79.59 ms | 96.37 ms | 145.36 ms |
cuda_neg | 103.26 ms | 129.4 ms | 157.19 ms | 187.09 ms | 321.27 ms | 530.11 ms |
cuda_sub | 103.34 ms | 129.42 ms | 156.87 ms | 187.01 ms | 321.04 ms | 528.13 ms |
Last updated