TFHE-rs now includes a GPU backend, featuring a CUDA implementation for performing integer arithmetics on encrypted data. In what follows, a simple tutorial is introduced: it shows how to update your existing program to use GPU acceleration, or how to start a new one using GPU.
To use the TFHE-rs GPU backend
in your project, you first need to add it as a dependency in your Cargo.toml
.
If you are using an x86
machine:
If you are using an ARM
machine:
When running code that uses TFHE-rs
, it is highly recommended to run in release mode with cargo's --release
flag to have the best possible performance
TFHE-rs GPU backend is supported on Linux (x86, aarch64).
In comparison with the CPU example, the only difference lies into the key creation, which is detailed here
Here is a full example (combining the client and server parts):
The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the Client (which is assumed to run on a CPU), the server key has then to be decompressed by the Server to be converted into the right format. To do so, the server should run this function: decompressed_to_gpu()
. From then on, there is no difference between the CPU and the GPU.
On the client-side, the method to encrypt the data is exactly the same than the CPU one, i.e.:
The server must first set its keys up, like in the CPU, with: set_server_key(gpu_key);
. Then, homomorphic computations are done with the same code than the one described here.
Finally, the client gets the decrypted results by computing:
TFHE-rs includes the possibility to leverage the high number of threads given by a GPU. To do so, the configuration should be updated with Rust let config = ConfigBuilder::with_custom_parameters(PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
The complete example becomes:
The GPU backend includes the following operations:
All operations follow the same syntax than the one described in here.
The tables below contain benchmarks for homomorphic operations running on a single V100 from AWS (p3.2xlarge machines), with the default parameters:
OS | x86 | aarch64 |
---|---|---|
Operation \ Size | FheUint8 | FheUint16 | FheUint32 | FheUint64 | FheUint128 | FheUint256 |
---|---|---|---|---|---|---|
Linux
x86_64-unix
aarch64-unix
*
macOS
Unsupported
Unsupported*
Windows
Unsupported
Unsupported
cuda_add
103.33 ms
129.26 ms
156.83 ms
186.99 ms
320.96 ms
528.15 ms
cuda_bitand
26.11 ms
26.21 ms
26.63 ms
27.24 ms
43.07 ms
65.01 ms
cuda_bitor
26.1 ms
26.21 ms
26.57 ms
27.23 ms
43.05 ms
65.0 ms
cuda_bitxor
26.08 ms
26.21 ms
26.57 ms
27.25 ms
43.06 ms
65.07 ms
cuda_eq
52.82 ms
53.0 ms
79.4 ms
79.58 ms
96.37 ms
145.25 ms
cuda_ge
104.7 ms
130.23 ms
156.19 ms
183.2 ms
213.43 ms
288.76 ms
cuda_gt
104.93 ms
130.2 ms
156.33 ms
183.38 ms
213.47 ms
288.8 ms
cuda_le
105.14 ms
130.47 ms
156.48 ms
183.44 ms
213.33 ms
288.75 ms
cuda_lt
104.73 ms
130.23 ms
156.2 ms
183.14 ms
213.33 ms
288.74 ms
cuda_max
156.7 ms
182.65 ms
210.74 ms
251.78 ms
316.9 ms
442.71 ms
cuda_min
156.85 ms
182.67 ms
210.39 ms
252.02 ms
316.96 ms
442.95 ms
cuda_mul
219.73 ms
302.11 ms
465.91 ms
955.66 ms
2.71 s
9.15 s
cuda_ne
52.72 ms
52.91 ms
79.28 ms
79.59 ms
96.37 ms
145.36 ms
cuda_neg
103.26 ms
129.4 ms
157.19 ms
187.09 ms
321.27 ms
530.11 ms
cuda_sub
103.34 ms
129.42 ms
156.87 ms
187.01 ms
321.04 ms
528.13 ms
name
symbol
Enc
/Enc
Enc
/ Int
Neg
-
N/A
Add
+
Sub
-
Mul
*
Div
/
Rem
%
Not
!
N/A
BitAnd
&
BitOr
|
BitXor
^
Shr
>>
Shl
<<
Rotate right
rotate_right
Rotate left
rotate_left
Min
min
Max
max
Greater than
gt
Greater or equal than
ge
Lower than
lt
Lower or equal than
le
Equal
eq
Cast (into dest type)
cast_into
N/A
Cast (from src type)
cast_from
N/A
Ternary operator
if_then_else