TFHE-rs v1.4 - October 2025

Summary


TFHE-rs v1.4.1 improves performance, adds new cryptographic capabilities, and enhances hardware support across CPU, GPU, and HPU backends.

See full details below:

CPU


Highlights

The CPU backend introduces new APIs for additional security guarantees, extended atomic pattern support, and new encrypted data handling capabilities:

  • Security — Introduces the `ReRand` feature to ensure security under the sIND-CPAᴰ model.

  • Extended KS32 AP support : The keyswitch 32 atomic pattern (KS32 AP) now supports compact public key encryption, keyswitching, compression, and noise squashing.

  • Performance: KS32 AP provides a 10–19% speedup on 64-bit integer operations.

  • Encrypted data handling: Adds KVStore to manipulate hashmaps in a blind way to update encrypted values.

  • Parameter clarity: Parameter sets are now standardized and exposed as `MetaParameters`.

New Features

  • Add MetaParameters

  • Add multi bit PBS support to noise squashing

  • Add noise squashing support for the KS32 AP

  • Add ciphertext compression support for the KS32 AP

  • Add compact public key encryption support for the KS32 AP

  • Add quasi-uniform OPRF over any range for tfhe::integer

  • Add KVStore for blind encrypted key-value updates

  • Add flip operation

  • Add ReRand primitives for sIND-CPAᴰ security

  • Add XOF keyset

  • Make FheUint/FheInt/FheBool compatible with AP params for conformance

  • Add missing safe_deser for ServerKey in the C API

Improvements

  • Improve FFT and NTT plan cache locking

Fixes

  • Set correct degree for noise squashed decompressed ciphertext

  • Avoid potential overflow for GLWE encryption on 32 bits platforms

  • Fix NTT plan yielding incorrect results for a class of primes

  • Fix scalar size check before ZK public key encryption

GPU


The GPU backend receives major performance upgrades, improved PBS techniques, and new compression and benchmarking capabilities:

  • Performance: All operations see 2× speedup on H100 GPUs, with certain primitives (multiplication, division, OPRF, ilog2, scalar division and multiplication) reaching 3–10× acceleration.

  • PBS enhancements: A new technique called "mean reduction" replaces the previous technique "drift" for classical PBS, to keep the same cryptographic parameters without the need for an additional key.

  • Noise squashing: Multi-bit noise squashing is introduced, providing up to 4× faster execution compared to classical PBS.

  • Compression: Adds support for 128-bit compression.

  • New benchmark: A new benchmark on GPU is introduced to perform AES encryption using FHE (in counter mode).

  • Parameter clarity: Parameter sets are now standardized and exposed as `MetaParameters`.

New Features

  • Add 128-bit multi-bit PBS for noise squashing

  • Add 128-bit compression

  • Add the centered modulus switch technique to reduce noise in the classical PBS

  • FHE encryption of AES 128 in counter mode on GPU (available in the integer API)

Improvements

  • Create specialized version of multi-bit pbs using thread block clusters: this results in a significant performance improvement on all operations on H100 (x2)

  • Improve the multi-GPU communication scheme

  • Use CUDA mempools to optimize memory reuse

  • Improve division performance on nodes with 4 GPUs or more: overall division is 4x faster than in the previous release

  • Improve encrypted random generation (OPRF) performance by implementing it in CUDA/C++ instead of Rust (results in 10x faster OPRF)

  • Improve ilog2 performance by implementing it in CUDA/C++ instead of Rust

  • Enable lut generation with preallocated CPU buffers to avoid some synchronizations with the CPU in comparisons

  • Add an assert to be sure the carry part has correct size in expand

  • Create message extract lut only when needed for carry propagation

  • Internal refactors to enhance the C++/Rust interface (pass streams and gpu indexes in a struct, pass compression data via a struct)

Fixes

  • Fix memory leak in multi-gpu calculations

  • Fix pbs128 multi-gpu bug

  • Fix some wrong indexes used in cuda_set_device().

  • Fix inconsistent types to avoid overflows

  • Add missing syncs when releasing scalar ops and returning trivial radix

  • Fix the decompression function signature in the CUDA backend

HPU


The HPU backend improves overall latency and execution throughput:

  • Latency reduction: Overall execution latency is reduced across all HPU operations.

  • Throughput increase: New SIMD operations have been added, which are further enhancing the throughput of HPU on a single V80 FPGA.

New Features

  • Add 400Mhz HPU v2.1 bitstream

  • Add ERC20_SIMD & ADD_SIMD operations

  • Add support of servers with multiple V80 boards (only one is used)

Improvements

  • Improve latency & throughput benches (HLAPI & integer) to execute some new operations and be more stable

  • Improve scheduling of MUL operation

  • Reduce a bit SW latency to push IOp and receive IOp acknowledge

  • In HPU v2.1 bitstream:

    • Compiled with Vivado 2025.1

    • Improved place & route (especially on reset) to reach 400Mhz

    • Increase bandwidth to load BSK & KSK

    • Improved accumulator (MMACC) structure to match PBS batch size (12)

Fixes

  • Stabilize HPU IOp queue

  • Fix a few operations (ilog2, trail0/1, ovf_mul...)

Resources

Last updated