This guide explains how to update your existing program to leverage GPU acceleration, or to start a new program using GPU.
TFHE-rs now supports a GPU backend with CUDA implementation, enabling integer arithmetics operations on encrypted data.
To use the TFHE-rs GPU backend in your project, add the following dependency in your Cargo.toml
.
If you are using an x86
machine:
If you are using an ARM
machine:
For optimal performance when using TFHE-rs, run your code in release mode with the --release
flag.
TFHE-rs GPU backend is supported on Linux (x86, aarch64).
Comparing to the CPU example, GPU set up differs in the key creation, as detailed here
Here is a full example (combining the client and server parts):
The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the client (which is assumed to run on a CPU), the server key has then to be decompressed by the server to be converted into the right format. To do so, the server should run this function: decompressed_to_gpu()
.
Once decompressed, the operations between CPU and GPU are identical.
On the client-side, the method to encrypt the data is exactly the same than the CPU one, as shown in the following example:
The server first need to set up its keys with set_server_key(gpu_key)
.
Then, homomorphic computations are performed using the same approach as the CPU operations.
Finally, the client decrypts the results using:
TFHE-rs allows to leverage the high number of threads given by a GPU. To maximize the number of GPU threads, update your configuration accordingly:
Here's the complete example:
The GPU backend includes the following operations:
The equivalent signed operations are also available.
All operations follow the same syntax than the one described in here.
All GPU benchmarks presented here were obtained on a single H100 GPU, and rely on the multithreaded PBS algorithm. The cryptographic parameters PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS
were used.
The following table shows the performance when the inputs of the benchmarked operation are encrypted:
The following table shows the performance when the left input of the benchmarked operation is encrypted and the other is a clear scalar of the same size:
OS | x86 | aarch64 |
---|---|---|
Linux
x86_64-unix
aarch64-unix
*
macOS
Unsupported
Unsupported*
Windows
Unsupported
Unsupported
Operation \ Size
FheUint7
FheUint16
FheUint32
FheUint64
FheUint128
FheUint256
Negation (-
)
46 ms
60 ms
75 ms
94 ms
150 ms
247 ms
Add / Sub (+
,-
)
46 ms
60 ms
75 ms
94 ms
150 ms
247 ms
Mul (x
)
83 ms
121 ms
195 ms
456 ms
1.35 s
4.74 s
Equal / Not Equal (eq
, ne
)
25 ms
26 ms
38 ms
41 ms
52 ms
78 ms
Comparisons (ge
, gt
, le
, lt
)
46 ms
60 ms
74 ms
90 ms
109 ms
153 ms
Max / Min (max
,min
)
71 ms
86 ms
101 ms
124 ms
165 ms
236 ms
Bitwise operations (&
, |
, ^
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Shifts (<<
, >>
)
71 ms
88 ms
109 ms
180 ms
279 ms
494 ms
Left / Right Rotations (left_rotate
, right_rotate
)
71 ms
88 ms
109 ms
180 ms
279 ms
494 ms
Operation \ Size
FheUint7
FheUint16
FheUint32
FheUint64
FheUint128
FheUint256
Add / Sub (+
,-
)
46 ms
60 ms
75 ms
94 ms
152 ms
251 ms
Mul (*
)
67 ms
101 ms
149 ms
282 ms
727 ms
2.11 s
Equal / Not Equal (eq
, ne
)
26 ms
27 ms
27 ms
41 ms
45 ms
57 ms
Comparisons (ge
, gt
, le
, lt
)
29 ms
41 ms
54 ms
69 ms
87 ms
117 ms
Max / Min (max
,min
)
53 ms
65 ms
81 ms
102 ms
142 ms
200 ms
Bitwise operations (&
, |
, ^
)
11 ms
13 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Shifts (<<
, >>
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Rotations (left_rotate
, right_rotate
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
name
symbol
Enc
/Enc
Enc
/ Int
Neg
-
N/A
Add
+
Sub
-
Mul
*
Div
/
Rem
%
Not
!
N/A
BitAnd
&
BitOr
|
BitXor
^
Shr
>>
Shl
<<
Rotate right
rotate_right
Rotate left
rotate_left
Min
min
Max
max
Greater than
gt
Greater or equal than
ge
Lower than
lt
Lower or equal than
le
Equal
eq
Cast (into dest type)
cast_into
N/A
Cast (from src type)
cast_from
N/A
Ternary operator
if_then_else