Concrete ML
WebsiteLibrariesProducts & ServicesDevelopersSupport
1.9
1.9
  • Welcome
  • Get Started
    • What is Concrete ML?
    • Installation
    • Key concepts
    • Inference in the cloud
  • Built-in Models
    • Linear models
    • Tree-based models
    • Neural networks
    • Nearest neighbors
    • Encrypted dataframe
    • Encrypted training
  • LLMs
    • Inference
    • Encrypted fine-tuning
  • Deep Learning
    • Using Torch
    • Using ONNX
    • Step-by-step guide
    • Debugging models
    • Optimizing inference
  • Guides
    • Prediction with FHE
    • Production deployment
    • Hybrid models
    • Serialization
    • GPU acceleration
  • Tutorials
    • See all tutorials
    • Built-in model examples
    • Deep learning examples
  • References
    • API
  • Explanations
    • Security and correctness
    • Quantization
    • Pruning
    • Compilation
    • Advanced features
    • Project architecture
      • Importing ONNX
      • Quantization tools
      • FHE Op-graph design
      • External libraries
  • Developers
    • Set up the project
    • Set up Docker
    • Documentation
    • Support and issues
    • Contributing
    • Support new ONNX node
    • Release note
    • Feature request
    • Bug report
Powered by GitBook

Libraries

  • TFHE-rs
  • Concrete
  • Concrete ML
  • fhEVM

Developers

  • Blog
  • Documentation
  • Github
  • FHE resources

Company

  • About
  • Introduction to FHE
  • Media
  • Careers
On this page
  • Introduction
  • Circuit bit-width optimization
  • Structured pruning
  • Rounded activations and quantizers
  • TLU error tolerance adjustment

Was this helpful?

Export as PDF
  1. Deep Learning

Optimizing inference

PreviousDebugging modelsNextPrediction with FHE

Last updated 1 month ago

Was this helpful?

This document introduces several approaches to reduce the overall latency of a neural network.

Introduction

Neural networks are challenging for encrypted inference. Each neuron in a network has to apply an activation function that requires a operation. The latency of a single PBS depends on the bit-width of its input.

Circuit bit-width optimization

and introduce specific hyper-parameters that influence the accumulator sizes. You can chose quantization and pruning configurations to reduce the accumulator size. To obtain a trade-off between latency and accuracy, you can manually set these hyper-parameters as described in the .

Structured pruning

While using unstructured pruning ensures the accumulator bit-width stays low, can eliminate entire neurons from the network as many neural networks are over-parametrized for easier training. You can apply structured pruning to a trained network as a fine-tuning step. demonstrates how to apply structured pruning to built-in neural networks using the helper function. To apply structured pruning to custom models, it is recommended to use the package.

Rounded activations and quantizers

Reducing the bit-width of inputs to the Table Lookup (TLU) operations significantly improves latency. Post-training, you can leverage properties of the fused activation and quantization functions in the TLUs to further reduce the accumulator size. This is achieved through the rounded PBS feature as described in the . Adjusting the rounding amount relative to the initial accumulator size can improve latency while maintaining accuracy.

TLU error tolerance adjustment

Finally, the TFHE scheme introduces a TLU error tolerance parameter that has an impact on crypto-system parameters that influence latency. A higher tolerance of TLU off-by-one errors results in faster computations but may reduce accuracy. You can think of the error of obtaining T[x]T[x]T[x] as a Gaussian distribution centered on xxx: TLU[x]TLU[x]TLU[x] is obtained with probability of 1 - p_error, while T[x−1]T[x-1]T[x−1], T[x+1]T[x+1]T[x+1] are obtained with much lower probability, etc. In Deep NNs, these type of errors can be tolerated up to some point. See the and more specifically .

Quantization Aware Training
pruning
structured pruning
This example
prune
torch-pruning
Programmable Bootstrapping(PBS)
rounded activations and quantizers reference
p_error documentation for details
the API for finding the best p_error
deep learning design guide