Inference
Last updated
Was this helpful?
Last updated
Was this helpful?
LLMs can be converted to use FHE to generate encrypted tokens based on encrypted prompts. Concrete ML implements LLM inference as a client/server protocol where
The client executes non-linear layers in the LLM, such as attention and activation functions.
The server executes linear layers, such as projection and embedding.
The FHE LLM implementation in Concrete ML has the following characteristics:
Data transfer is necessary for each linear layer. The size of encrypted data is about 4x the size of the clear data that are input/outputs to the linear layers. For instance:
A model exchanges around 18MB of data per token.
A mode exchanges around 2.2MB of data per token.
The client machine needs to perform some computation, thus it needs to execute some PyTorch layers.
Advantages of FHE include:
Offloading computation from clients with limited hardware.
Preserving intellectual property by running sensitive model components on encrypted data.
This document introduces how to use Concrete ML to run encrypted LLM inference with FHE.
To prepare an LLM model for FHE inference, use the HybridFHEModel
class:
After compile_model
is called as above, you can retrieve the FHE-enabled model inhybrid_model.model
.
As for all Concrete ML models, to verify accuracy of the converted LLM on clear data, you can use fhe='disable'
or fhe='simulate'
. To actually executed on
encrypted data, set the fhe_mode
to execute
:
Next, to generate some tokens using FHE computation, run:
The Concrete ML LLM model inference, as described above, can use GPUs to obtain acceleration. Running on GPU reduces latency by ~30x. For example, generating a GPT2 token on GPU takes ~11 seconds, while it takes ~300 seconds.