TFHE-rs v1.4 - October 2025

Summary

TFHE-rs v1.4.1 improves performance, adds new cryptographic capabilities, and enhances hardware support across CPU, GPU, and HPU backends.

See full details below:

CPU

Highlights

The CPU backend introduces new APIs for additional security guarantees, extended atomic pattern support, and new encrypted data handling capabilities：

Security — Introduces the `ReRand` feature to ensure security under the sIND-CPAᴰ model.
Extended KS32 AP support : The keyswitch 32 atomic pattern (KS32 AP) now supports compact public key encryption, keyswitching, compression, and noise squashing.
Performance: KS32 AP provides a 10–19% speedup on 64-bit integer operations.
Encrypted data handling: Adds KVStore to manipulate hashmaps in a blind way to update encrypted values.
Parameter clarity: Parameter sets are now standardized and exposed as `MetaParameters`.

New Features

Add MetaParameters
Add multi bit PBS support to noise squashing
Add noise squashing support for the KS32 AP
Add ciphertext compression support for the KS32 AP
Add compact public key encryption support for the KS32 AP
Add quasi-uniform OPRF over any range for tfhe::integer
Add KVStore for blind encrypted key-value updates
Add flip operation
Add ReRand primitives for sIND-CPAᴰ security
Add XOF keyset
Make FheUint/FheInt/FheBool compatible with AP params for conformance
Add missing safe_deser for ServerKey in the C API

Improvements

Improve FFT and NTT plan cache locking

Fixes

Set correct degree for noise squashed decompressed ciphertext
Avoid potential overflow for GLWE encryption on 32 bits platforms
Fix NTT plan yielding incorrect results for a class of primes
Fix scalar size check before ZK public key encryption

GPU

The GPU backend receives major performance upgrades, improved PBS techniques, and new compression and benchmarking capabilities:

Performance: All operations see 2× speedup on H100 GPUs, with certain primitives (multiplication, division, OPRF, ilog2, scalar division and multiplication) reaching 3–10× acceleration.
PBS enhancements: A new technique called "mean reduction" replaces the previous technique "drift" for classical PBS, to keep the same cryptographic parameters without the need for an additional key.
Noise squashing: Multi-bit noise squashing is introduced, providing up to 4× faster execution compared to classical PBS.
Compression: Adds support for 128-bit compression.
New benchmark: A new benchmark on GPU is introduced to perform AES encryption using FHE (in counter mode).
Parameter clarity: Parameter sets are now standardized and exposed as `MetaParameters`.

New Features

Add 128-bit multi-bit PBS for noise squashing
Add 128-bit compression
Add the centered modulus switch technique to reduce noise in the classical PBS
FHE encryption of AES 128 in counter mode on GPU (available in the integer API)

Improvements

Create specialized version of multi-bit pbs using thread block clusters: this results in a significant performance improvement on all operations on H100 (x2)
Improve the multi-GPU communication scheme
Use CUDA mempools to optimize memory reuse
Improve division performance on nodes with 4 GPUs or more: overall division is 4x faster than in the previous release
Improve encrypted random generation (OPRF) performance by implementing it in CUDA/C++ instead of Rust (results in 10x faster OPRF)
Improve ilog2 performance by implementing it in CUDA/C++ instead of Rust
Enable lut generation with preallocated CPU buffers to avoid some synchronizations with the CPU in comparisons
Add an assert to be sure the carry part has correct size in expand
Create message extract lut only when needed for carry propagation
Internal refactors to enhance the C++/Rust interface (pass streams and gpu indexes in a struct, pass compression data via a struct)

Fixes

Fix memory leak in multi-gpu calculations
Fix pbs128 multi-gpu bug
Fix some wrong indexes used in cuda_set_device().
Fix inconsistent types to avoid overflows
Add missing syncs when releasing scalar ops and returning trivial radix
Fix the decompression function signature in the CUDA backend

HPU

The HPU backend improves overall latency and execution throughput:

Latency reduction: Overall execution latency is reduced across all HPU operations.
Throughput increase: New SIMD operations have been added, which are further enhancing the throughput of HPU on a single V80 FPGA.

New Features

Add 400Mhz HPU v2.1 bitstream
Add ERC20_SIMD & ADD_SIMD operations
Add support of servers with multiple V80 boards (only one is used)

Improvements

Improve latency & throughput benches (HLAPI & integer) to execute some new operations and be more stable
Improve scheduling of MUL operation
Reduce a bit SW latency to push IOp and receive IOp acknowledge
In HPU v2.1 bitstream:
- Compiled with Vivado 2025.1
- Improved place & route (especially on reset) to reach 400Mhz
- Increase bandwidth to load BSK & KSK
- Improved accumulator (MMACC) structure to match PBS batch size (12)

Fixes

Stabilize HPU IOp queue
Fix a few operations (ilog2, trail0/1, ovf_mul...)

Resources

NextTFHE-rs v1.3 - July 2025

Last updated 25 minutes ago