GPU acceleration
This guide explains how to update your existing program to leverage GPU acceleration, or to start a new program using GPU.
TFHE-rs now supports a GPU backend with CUDA implementation, enabling integer arithmetics operations on encrypted data.
Prerequisites
Importing to your project
To use the TFHE-rs GPU backend in your project, add the following dependency in your Cargo.toml
.
If you are using an x86
machine:
tfhe = { version = "0.6.4", features = [ "boolean", "shortint", "integer", "x86_64-unix", "gpu" ] }
If you are using an ARM
machine:
tfhe = { version = "0.6.4", features = [ "boolean", "shortint", "integer", "aarch64-unix", "gpu" ] }
For optimal performance when using TFHE-rs, run your code in release mode with the --release
flag.
Supported platforms
TFHE-rs GPU backend is supported on Linux (x86, aarch64).
Linux
x86_64-unix
aarch64-unix
*
macOS
Unsupported
Unsupported*
Windows
Unsupported
Unsupported
A first example
Configuring and creating keys.
Comparing to the CPU example, GPU set up differs in the key creation, as detailed here
Here is a full example (combining the client and server parts):
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
use tfhe::prelude::*;
fn main() {
let config = ConfigBuilder::default().build();
let client_key= ClientKey::generate(config);
let compressed_server_key = CompressedServerKey::new(&client_key);
let gpu_key = compressed_server_key.decompress_to_gpu();
let clear_a = 27u8;
let clear_b = 128u8;
let a = FheUint8::encrypt(clear_a, &client_key);
let b = FheUint8::encrypt(clear_b, &client_key);
//Server-side
set_server_key(gpu_key);
let result = a + b;
//Client-side
let decrypted_result: u8 = result.decrypt(&client_key);
let clear_result = clear_a + clear_b;
assert_eq!(decrypted_result, clear_result);
}
Setting the keys
The configuration of the key is different from the CPU. More precisely, if both client and server keys are still generated by the client (which is assumed to run on a CPU), the server key has then to be decompressed by the server to be converted into the right format. To do so, the server should run this function: decompressed_to_gpu()
.
Once decompressed, the operations between CPU and GPU are identical.
Encryption
On the client-side, the method to encrypt the data is exactly the same than the CPU one, as shown in the following example:
let clear_a = 27u8;
let clear_b = 128u8;
let a = FheUint8::encrypt(clear_a, &client_key);
let b = FheUint8::encrypt(clear_b, &client_key);
Computation
The server first need to set up its keys with set_server_key(gpu_key)
.
Then, homomorphic computations are performed using the same approach as the CPU operations.
//Server-side
set_server_key(gpu_key);
let result = a + b;
//Client-side
let decrypted_result: u8 = result.decrypt(&client_key);
let clear_result = clear_a + clear_b;
assert_eq!(decrypted_result, clear_result);
Decryption
Finally, the client decrypts the results using:
let decrypted_result: u8 = result.decrypt(&client_key);
Improving performance.
TFHE-rs allows to leverage the high number of threads given by a GPU. To maximize the number of GPU threads, update your configuration accordingly:
let config = ConfigBuilder::with_custom_parameters(PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
Here's the complete example:
use tfhe::{ConfigBuilder, set_server_key, FheUint8, ClientKey, CompressedServerKey};
use tfhe::prelude::*;
use tfhe::shortint::parameters::PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS;
fn main() {
let config = ConfigBuilder::with_custom_parameters(PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS, None).build();
let client_key= ClientKey::generate(config);
let compressed_server_key = CompressedServerKey::new(&client_key);
let gpu_key = compressed_server_key.decompress_to_gpu();
let clear_a = 27u8;
let clear_b = 128u8;
let a = FheUint8::encrypt(clear_a, &client_key);
let b = FheUint8::encrypt(clear_b, &client_key);
//Server-side
set_server_key(gpu_key);
let result = a + b;
//Client-side
let decrypted_result: u8 = result.decrypt(&client_key);
let clear_result = clear_a + clear_b;
assert_eq!(decrypted_result, clear_result);
}
List of available operations
The GPU backend includes the following operations:
name
symbol
Enc
/Enc
Enc
/ Int
Neg
-
✔️
N/A
Add
+
✔️
✔️
Sub
-
✔️
✔️
Mul
*
✔️
✔️
Div
/
✖️
✖️
Rem
%
✖️
✖️
Not
!
✔️
N/A
BitAnd
&
✔️
✔️
BitOr
|
✔️
✔️
BitXor
^
✔️
✔️
Shr
>>
✔️
✔️
Shl
<<
✔️
✔️
Rotate right
rotate_right
✔️
✔️
Rotate left
rotate_left
✔️
✔️
Min
min
✔️
✔️
Max
max
✔️
✔️
Greater than
gt
✔️
✔️
Greater or equal than
ge
✔️
✔️
Lower than
lt
✔️
✔️
Lower or equal than
le
✔️
✔️
Equal
eq
✔️
✔️
Cast (into dest type)
cast_into
✖️
N/A
Cast (from src type)
cast_from
✖️
N/A
Ternary operator
if_then_else
✔️
✖️
The equivalent signed operations are also available.
Benchmarks
All GPU benchmarks presented here were obtained on a single H100 GPU, and rely on the multithreaded PBS algorithm. The cryptographic parameters PARAM_GPU_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS
were used.
The following table shows the performance when the inputs of the benchmarked operation are encrypted:
Operation \ Size
FheUint7
FheUint16
FheUint32
FheUint64
FheUint128
FheUint256
Negation (-
)
46 ms
60 ms
75 ms
94 ms
150 ms
247 ms
Add / Sub (+
,-
)
46 ms
60 ms
75 ms
94 ms
150 ms
247 ms
Mul (x
)
83 ms
121 ms
195 ms
456 ms
1.35 s
4.74 s
Equal / Not Equal (eq
, ne
)
25 ms
26 ms
38 ms
41 ms
52 ms
78 ms
Comparisons (ge
, gt
, le
, lt
)
46 ms
60 ms
74 ms
90 ms
109 ms
153 ms
Max / Min (max
,min
)
71 ms
86 ms
101 ms
124 ms
165 ms
236 ms
Bitwise operations (&
, |
, ^
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Shifts (<<
, >>
)
71 ms
88 ms
109 ms
180 ms
279 ms
494 ms
Left / Right Rotations (left_rotate
, right_rotate
)
71 ms
88 ms
109 ms
180 ms
279 ms
494 ms
The following table shows the performance when the left input of the benchmarked operation is encrypted and the other is a clear scalar of the same size:
Operation \ Size
FheUint7
FheUint16
FheUint32
FheUint64
FheUint128
FheUint256
Add / Sub (+
,-
)
46 ms
60 ms
75 ms
94 ms
152 ms
251 ms
Mul (*
)
67 ms
101 ms
149 ms
282 ms
727 ms
2.11 s
Equal / Not Equal (eq
, ne
)
26 ms
27 ms
27 ms
41 ms
45 ms
57 ms
Comparisons (ge
, gt
, le
, lt
)
29 ms
41 ms
54 ms
69 ms
87 ms
117 ms
Max / Min (max
,min
)
53 ms
65 ms
81 ms
102 ms
142 ms
200 ms
Bitwise operations (&
, |
, ^
)
11 ms
13 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Shifts (<<
, >>
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
Left / Right Rotations (left_rotate
, right_rotate
)
11 ms
12 ms
13 ms
15 ms
23 ms
32 ms
Last updated
Was this helpful?