Concrete ML
WebsiteLibrariesProducts & ServicesDevelopersSupport
0.2
0.2
  • What is Concrete ML?
  • Installing
    • Installing
  • How To
    • Scikit-learn
    • Torch
    • Compute with Quantized Functions
    • Use Concrete ML ONNX Support
    • Debug / Get Support / Submit Issues
  • Advanced examples
    • Advanced examples
  • Explanations
    • Philosophy of the Design
    • Quantization
    • Pruning
    • Virtual Lib
    • Resources
  • Developper How To
    • Set Up the Project
    • Set Up Docker
    • Document
    • Create a Release on GitHub
    • Contribute
  • Developper Explanations
    • Concrete Stack
    • Quantization
    • Using ONNX as IR for FHE Compilation
    • Hummingbird Usage
    • Skorch Usage
  • API
    • API
Powered by GitBook

Libraries

  • TFHE-rs
  • Concrete
  • Concrete ML
  • fhEVM

Developers

  • Blog
  • Documentation
  • Github
  • FHE resources

Company

  • About
  • Introduction to FHE
  • Media
  • Careers
On this page
  • Why is it needed?
  • The gist of quantization
  • Quantization in practice
  • Resources

Was this helpful?

Export as PDF
  1. Explanations

Quantization

PreviousPhilosophy of the DesignNextPruning

Last updated 2 years ago

Was this helpful?

from :

Quantization is the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers).

Why is it needed?

Modern computing has been using data types that are 32 or 64 bits wide for many years, for both integers and floating point values. Even bigger data types are available or can be constructed easily. However, due to the costly nature of FHE computations (see ), using such types with FHE is impractical (or plain impossible) if we are to execute computations in a reasonable amount of time.

The gist of quantization

The basic idea of quantization is to take a range of values that are represented by a large data type and represent them using a single value of a smaller data type. This means that some accuracy in the representation is lost (e.g. a simple approach is to eliminate least-significant bits), but, in many cases in machine learning, it is possible to adapt the models to give meaningful results while using these smaller data types. This significantly reduces the number of bits necessary for intermediary results during the execution of these machine learning models.

Quantization in practice

Let's first define some notations. Let [α,β][\alpha, \beta ][α,β] be the range of our value to quantize where α\alphaα is the minimum and β\betaβ is the maximum.

To quantize a range with floating point values (in R\mathbb{R}R) to integer values (in Z\mathbb{Z}Z), we first need to choose the data type that is going to be used. Concrete-Library, the backend library used by Concrete-ML, is currently limited to 7-bit integers, so we'll use this value for the example. Knowing the number of bits that can be used, for a value in the range [α,β][\alpha, \beta ][α,β], we can compute the scale SSS of the quantization:

S=β−α2n−1S = \frac{\beta - \alpha}{2^n - 1}S=2n−1β−α​

where nnn is the number of bits (here, 7).

In practice, the quantization scale is then S=β−α127S = \frac{\beta - \alpha}{127}S=127β−α​. This means the gap between consecutive representable values cannot be smaller than SSS, which, in turn, means there can be a substantial loss of precision. Every interval of length SSS will be represented by a value within the range [0..127][0..127][0..127].

The other important parameter from this quantization schema is the zero point ZZZ value. This essentially brings the 0 floating point value to a specific integer. If the quantization scheme is asymmetric (quantized values are not centered in 0), the resulting integer will be in Z\mathbb{Z}Z.

Z=round(−αS)Z = \mathtt{round} \left(- \frac{\alpha}{S} \right)Z=round(−Sα​)

Regarding quantization in Concrete-ML and FHE compilation, it is important to understand the difference between two approaches:

  1. The quantization is done automatically during the model compilation stage (inside our framework). This approach requires little work by the user, but may not be a one-size-fits-all solution for all types of models that a user may want to implement.

  2. The quantization is done by the user, before compilation to FHE; notably, the quantization is completely controlled by the user, and can be done by any means, including by using third-party frameworks. In this approach, the user is responsible for implementing their models directly with NumPy.

For the moment, the first method is applicable through the tools provided by in Concrete-ML, and the models implemented in our framework make use of this approach. When quantization is only performed in the compilation stage, the model training stage does not take into account that the model will be quantized. This setting is called Post-Training Quantization (PTQ), and this is the approach currently taken in Concrete-ML. PTQ is effective for moderate bit widths, such as 7-8 bits per weight and activation, but, for a model to be compatible with FHE constraints, we must quantize these values to as few as 2-3 bits. Thus, for models with more than a few neurons per layer, PTQ is not the optimal solution, and we plan to implement a more performant approach called Quantization Aware Training in the near future.

Resources

When using quantized values in a matrix multiplication or convolution, the equations for computing the result are more involved. The IntelLabs distiller quantization documentation provides a more of the maths to quantize values and how to keep computations consistent.

We detail the use of quantization within Concrete-ML .

IntelLabs distiller explanation of quantization:

Lei Mao's blog on quantization:

Google paper on neural network quantization and integer-only inference:

Wikipedia
limits of FHE
detailed explanation
here
Distiller documentation
Quantization for Neural Networks
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference