Concrete ML
WebsiteLibrariesProducts & ServicesDevelopersSupport
1.0
1.0
  • What is Concrete ML?
  • Getting Started
    • Installation
    • Key Concepts
    • Inference in the Cloud
    • Demos and Tutorials
  • Built-in Models
    • Linear Models
    • Tree-based Models
    • Neural Networks
    • Pandas
    • Built-in Model Examples
  • Deep Learning
    • Using Torch
    • Using ONNX
    • Step-by-step Guide
    • Deep Learning Examples
    • Debugging Models
    • Optimizing Inference
  • Advanced topics
    • Quantization
    • Pruning
    • Compilation
    • Prediction with FHE
    • Production Deployment
    • Advanced Features
  • Developer Guide
    • Workflow
      • Set Up the Project
      • Set Up Docker
      • Documentation
      • Support and Issues
      • Contributing
    • Inner Workings
      • Importing ONNX
      • Quantization Tools
      • FHE Op-graph Design
      • External Libraries
    • API
      • concrete.ml.common.check_inputs.md
      • concrete.ml.common.debugging.custom_assert.md
      • concrete.ml.common.debugging.md
      • concrete.ml.common.md
      • concrete.ml.common.serialization.decoder.md
      • concrete.ml.common.serialization.dumpers.md
      • concrete.ml.common.serialization.encoder.md
      • concrete.ml.common.serialization.loaders.md
      • concrete.ml.common.serialization.md
      • concrete.ml.common.utils.md
      • concrete.ml.deployment.deploy_to_aws.md
      • concrete.ml.deployment.deploy_to_docker.md
      • concrete.ml.deployment.fhe_client_server.md
      • concrete.ml.deployment.md
      • concrete.ml.deployment.server.md
      • concrete.ml.deployment.utils.md
      • concrete.ml.onnx.convert.md
      • concrete.ml.onnx.md
      • concrete.ml.onnx.onnx_impl_utils.md
      • concrete.ml.onnx.onnx_model_manipulations.md
      • concrete.ml.onnx.onnx_utils.md
      • concrete.ml.onnx.ops_impl.md
      • concrete.ml.pytest.md
      • concrete.ml.pytest.torch_models.md
      • concrete.ml.pytest.utils.md
      • concrete.ml.quantization.base_quantized_op.md
      • concrete.ml.quantization.md
      • concrete.ml.quantization.post_training.md
      • concrete.ml.quantization.quantized_module.md
      • concrete.ml.quantization.quantized_ops.md
      • concrete.ml.quantization.quantizers.md
      • concrete.ml.search_parameters.md
      • concrete.ml.search_parameters.p_error_search.md
      • concrete.ml.sklearn.base.md
      • concrete.ml.sklearn.glm.md
      • concrete.ml.sklearn.linear_model.md
      • concrete.ml.sklearn.md
      • concrete.ml.sklearn.qnn.md
      • concrete.ml.sklearn.qnn_module.md
      • concrete.ml.sklearn.rf.md
      • concrete.ml.sklearn.svm.md
      • concrete.ml.sklearn.tree.md
      • concrete.ml.sklearn.tree_to_numpy.md
      • concrete.ml.sklearn.xgb.md
      • concrete.ml.torch.compile.md
      • concrete.ml.torch.md
      • concrete.ml.torch.numpy_module.md
      • concrete.ml.version.md
Powered by GitBook

Libraries

  • TFHE-rs
  • Concrete
  • Concrete ML
  • fhEVM

Developers

  • Blog
  • Documentation
  • Github
  • FHE resources

Company

  • About
  • Introduction to FHE
  • Media
  • Careers
On this page
  • Circuit bit-width optimization
  • Structured pruning
  • Rounded activations and quantizers
  • TLU error probability adjustment

Was this helpful?

Export as PDF
  1. Deep Learning

Optimizing Inference

Neural networks pose unique challenges with regards to encrypted inference. Each neuron in a network applies an activation function that requires a PBS operation. The latency of a single PBS depends on the bit-width of the input of the PBS.

Several approaches can be used to reduce the overall latency of a neural network.

Circuit bit-width optimization

Quantization Aware Training and pruning introduce specific hyper-parameters that influence the accumulator sizes. It is possible to chose quantization and pruning configurations that reduce the accumulator size. A trade-off between latency and accuracy can be obtained by varying these hyper-parameters as described in the deep learning design guide.

Structured pruning

While un-structured pruning is used to ensure the accumulator bit-width stays low, structured pruning can eliminate entire neurons from the network. Many neural networks are over-parametrized (since this enables easier training) and some neurons can be removed. Structured pruning, applied to a trained network as a fine-tuning step, can be applied to built-in neural networks using the prune helper function as shown in this example. To apply structured pruning to custom models, it is recommended to use the torch-pruning package.

Rounded activations and quantizers

Reducing the bit-width of the inputs to the Table Lookup (TLU) operations is a major source of improvements in the latency. Post-training, it is possible to leverage some properties of the fused activation and quantization functions expressed in the TLUs to further reduce the accumulator. This is achieved through the rounded PBS feature as described in the rounded activations and quantizers reference. Adjusting the rounding amount, relative to the initial accumulator size, can bring large improvements in latency while maintaining accuracy.

TLU error probability adjustment

Finally, the TFHE scheme exposes a TLU error probability parameter that has an impact on crypto-system parameters that influence latency. A higher probability of TLU error results in faster computations but may reduce accuracy. One can think of the error of obtaining T[x]T[x]T[x] as a Gaussian distribution centered on xxx: TLU[x]TLU[x]TLU[x] is obtained with probability of 1 - p_error, while T[xāˆ’1]T[x-1]T[xāˆ’1], T[x+1]T[x+1]T[x+1] are obtained with much lower probability, etc. In Deep NNs, these type of errors can be tolerated up to some point. See the p_error documentation for details and more specifically the usage example of the API for finding the best p_error.

PreviousDebugging ModelsNextQuantization

Last updated 2 years ago

Was this helpful?