1 of 13

Optimization

This guide explains how to optimize Concrete circuits extensively.

It's split in 3 sections:

: to make circuits utilize more cores.
: to optimize the most expensive operation in Concrete.
: to make Concrete select more performant parameters.

Improve parallelism

This guide introduces the different options for parallelism in Concrete and how to utilize them to improve the execution time of Concrete circuits.

Modern CPUs have multiple cores to perform computation and utilizing multiple cores is a great way to boost performance.

There are two kinds of parallelism in Concrete:

Loop parallelism to make tensor operations parallel, achieved by using
Dataflow parallelism to make independent operations parallel, achieved by using

Loop parallelism is enabled by default, as it's supported on all platforms. Dataflow parallelism however is only supported on Linux, hence not enabled by default.

Dataflow parallelism

This guide explains dataflow parallelism and how it can improve the execution time of Concrete circuits.

Dataflow parallelism is particularly useful when the circuit performs computations that are neither completely independent (such as loop/doall parallelism) nor fully dependent (e.g. sequential, non-parallelizable code). In such cases dataflow tasks can execute as soon as their inputs are available and thus minimizing over-synchronization.

Without dataflow parallelism, circuit is executed operation by operation, like an imperative language. If the operations themselves are not tensorized, loop parallelism would not be utilized and the entire execution would happen in a single thread. Dataflow parallelism changes this by analyzing the operations and their dependencies within the circuit to determine what can be done in parallel and what cannot. Then it distributes the tasks that can be done in parallel to different threads.

For example:

import time

import numpy as np
from concrete import fhe

def f(x, y, z):
    # normally, you'd use fhe.array to construct a concrete tensor
    # but for this example, we just create a simple numpy array
    # so the matrix multiplication can happen on a cellular level
    a = np.array([[x, y], [z, 2]])
    b = np.array([[1, x], [z, y]])
    return fhe.array(a @ b)

inputset = fhe.inputset(fhe.uint3, fhe.uint3, fhe.uint3)

for dataflow_parallelize in [False, True]:
    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
    circuit = compiler.compile(inputset, dataflow_parallelize=dataflow_parallelize)

    circuit.keygen()
    for sample in inputset[:3]:  # warmup
        circuit.encrypt_run_decrypt(*sample)

    timings = []
    for sample in inputset[3:13]:
        start = time.time()
        result = circuit.encrypt_run_decrypt(*sample)
        end = time.time()

        assert np.array_equal(result, f(*sample))
        timings.append(end - start)

    if not dataflow_parallelize:
        print(f"without dataflow parallelize -> {np.mean(timings):.03f}s")
    else:
        print(f"   with dataflow parallelize -> {np.mean(timings):.03f}s")

This prints:

without dataflow parallelize -> 0.609s
   with dataflow parallelize -> 0.414s

The reason for that is:

// this is the generated MLIR for the circuit
// without dataflow, every single line would be executed one after the other

module {
  func.func @main(%arg0: !FHE.eint<7>, %arg1: !FHE.eint<7>, %arg2: !FHE.eint<7>) -> tensor<2x2x!FHE.eint<7>> {
  
    // but if you look closely, you can see that this multiplication
    %c1_i2 = arith.constant 1 : i2
    %0 = "FHE.mul_eint_int"(%arg0, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
    
    // is completely independent of this one, so dataflow makes them run in parallel
    %1 = "FHE.mul_eint"(%arg1, %arg2) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // however, this addition needs the first two operations
    // so dataflow waits until both are done before performing this one
    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // lastly, this multiplication is completely independent from the first three operations
    // so its execution starts in parallel when execution starts with dataflow
    %3 = "FHE.mul_eint"(%arg0, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // similar logic can be applied to the remaining operations...
    %4 = "FHE.mul_eint"(%arg1, %arg1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %5 = "FHE.add_eint"(%3, %4) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %6 = "FHE.mul_eint_int"(%arg2, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
    %c2_i3 = arith.constant 2 : i3
    %7 = "FHE.mul_eint_int"(%arg2, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
    %8 = "FHE.add_eint"(%6, %7) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %9 = "FHE.mul_eint"(%arg2, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %10 = "FHE.mul_eint_int"(%arg1, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
    %11 = "FHE.add_eint"(%9, %10) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %from_elements = tensor.from_elements %2, %5, %8, %11 : tensor<2x2x!FHE.eint<7>>
    return %from_elements : tensor<2x2x!FHE.eint<7>>
    
  }
}

To summarize, dataflow analyzes the circuit to determine which parts of the circuit can be run at the same time, and tries to run as many operations as possible in parallel.

When the circuit is tensorized, dataflow might slow execution down since the tensor operations already use multiple threads and adding dataflow on top creates congestion in the CPU between the HPX (dataflow parallelism runtime) and OpenMP (loop parallelism runtime). So try both before deciding on whether to use dataflow or not.

Tensorizing operations

This guide explains tensorization and how it can improve the execution time of Concrete circuits.

Tensors should be used instead of scalars when possible to maximize loop parallelism.

For example:

This prints:

Enabling dataflow is kind of letting the runtime do this for you. It'd also help in the specific case.

Optimize table lookups

This guide teaches how costly table lookups are, and how to optimize them to improve the execution time of Concrete circuits.

The most costly operation in Concrete is the table lookup operation, so one of the primary goals of optimizing performance is to reduce the amount of table lookups.

Furthermore, the bit width of the input of the table lookup plays a major role in performance.

The code above prints:

Reducing TLU

This guide teaches how to improve the execution time of Concrete circuits by reducing the amount of table lookups.

Reducing the amount of table lookups is probably the most complicated guide in this section as it's not automated. The idea is to use mathematical properties of operations to reduce the amount of table lookups needed to achieve the result.

One great example is in adding big integers in bitmap representation. Here is the basic implementation:

def add_bitmaps(x, y):
    result = fhe.zeros((N,))
    carry = 0

    addition = x + y
    for i in range(N):
        addition_and_carry = addition[i] + carry
        carry = addition_and_carry >> 1
        result[i] = addition_and_carry % 2

    return result

There are two table lookups within the loop body, one for >> and one for %.

This implementation is not optimal though, since the same output can be achieved with just a single table lookup:

def add_bitmaps(x, y):
    result = fhe.zeros((N,))
    carry = 0

    addition = x + y
    for i in range(N):
        addition_and_carry = addition[i] + carry
        carry = addition_and_carry >> 1
        result[i] = addition_and_carry - (carry * 2)

    return result

It was possible to do this because the original operations had a mathematical equivalence with the optimized operations and optimized operations achieved the same output with less table lookups!

Here is the full code example and some numbers for this optimization:

import numpy as np
from concrete import fhe

N = 32

def add_bitmaps_naive(x, y):
    result = fhe.zeros((N,))
    carry = 0

    addition = x + y
    for i in range(N):
        addition_and_carry = addition[i] + carry
        carry = addition_and_carry >= 2
        result[i] = addition_and_carry % 2

    return result

def add_bitmaps_optimized(x, y):
    result = fhe.zeros((N,))
    carry = 0

    addition = x + y
    for i in range(N):
        addition_and_carry = addition[i] + carry
        carry = addition_and_carry >> 1
        result[i] = addition_and_carry - (carry * 2)

    return result

inputset = fhe.inputset(fhe.tensor[fhe.uint1, N], fhe.tensor[fhe.uint1, N])
for (name, implementation) in [("naive", add_bitmaps_naive), ("optimized", add_bitmaps_optimized)]:
    compiler = fhe.Compiler(implementation, {"x": "encrypted", "y": "encrypted"})
    circuit = compiler.compile(inputset)

    print(
        f"{name:>9} implementation "
        f"-> {int(circuit.programmable_bootstrap_count)} table lookups "
        f"-> {int(circuit.complexity):_} complexity"
    )

prints:

    naive implementation -> 63 table lookups -> 2_427_170_697 complexity
optimized implementation -> 32 table lookups -> 1_224_206_208 complexity

which is almost half the amount of table lookups and ~2x less complexity for the same operation!

Implementation strategies

This guide teaches how to improve the execution time of Concrete circuits by using different conversion strategies for complex operations.

Concrete provides multiple implementation strategies for these complex operations:

The default strategy is the one that doesn't increase the input bit width, even if it's less optimal than the others. If you don't care about the input bit widths (e.g., if the inputs are only used in this operation), you should definitely change the default strategy.

Choosing the correct strategy can lead to big speedups. So if you are not sure which one to use, you can compile with different strategies and compare the complexity.

For example, the following code:

prints:

or:

prints:

As you can see, strategies can affect the performance a lot! So make sure to select the appropriate one for your use case if you want to optimize performance.

Round/truncating

This guide teaches how to improve the execution time of Concrete circuits by using some special operations that reduce the bit width of the input of the table lookup.

There are two extensions which can reduce the bit width of the table lookup input, and , which can improve performance by sacrificing exactness.

For example the following code:

prints:

Approximate mode

This guide teaches how to improve the execution time of Concrete circuits by using approximate mode for rounding.

You can enable approximate mode to gain even more performance when using rounding by sacrificing some more exactness:

import numpy as np
from concrete import fhe

inputset = fhe.inputset(fhe.uint10)
for lsbs_to_remove in range(0, 10):
    def f(x):
        return fhe.round_bit_pattern(x, lsbs_to_remove, exactness=fhe.Exactness.APPROXIMATE) // 2

    compiler = fhe.Compiler(f, {"x": "encrypted"})
    circuit = compiler.compile(inputset)

    print(f"{lsbs_to_remove=} -> {int(circuit.complexity):>13_} complexity")

prints:

lsbs_to_remove=0 -> 9_134_406_574 complexity
lsbs_to_remove=1 -> 5_548_275_712 complexity
lsbs_to_remove=2 -> 2_430_793_927 complexity
lsbs_to_remove=3 -> 1_058_638_119 complexity
lsbs_to_remove=4 ->   409_952_712 complexity
lsbs_to_remove=5 ->   172_138_947 complexity
lsbs_to_remove=6 ->    99_198_195 complexity
lsbs_to_remove=7 ->    71_644_380 complexity
lsbs_to_remove=8 ->    55_860_516 complexity
lsbs_to_remove=9 ->    50_978_148 complexity

Bit extraction

This guide teaches how to improve the execution time of Concrete circuits by using bit extraction.

Bit extraction is a cheap way to extract certain bits of encrypted values. It can be very useful for improving the performance of circuits.

For example:

import numpy as np
from concrete import fhe

inputset = fhe.inputset(fhe.uint6)
for bit_extraction in [False, True]:
    def is_even(x):
        return (
            x % 2 == 0
            if not bit_extraction
            else 1 - fhe.bits(x)[0]
        )

    compiler = fhe.Compiler(is_even, {"x": "encrypted"})
    circuit = compiler.compile(inputset)

    if not bit_extraction:
        print(f"without bit extraction -> {int(circuit.complexity):>11_} complexity")
    else:
        print(f"   with bit extraction -> {int(circuit.complexity):>11_} complexity")

prints:

without bit extraction -> 230_210_706 complexity
   with bit extraction ->  29_506_014 complexity

That's almost 8x improvement to circuit complexity!

Optimize cryptographic parameters

This guide explains how to help Concrete Optimizer to select more performant parameters to improve the execution time of Concrete circuits.

The idea is to obtain more optimal cryptographic parameters (especially for table lookups) without changing the operations within the circuit.

Error probability

This guide explains how setting p_error configuration option can affect the performance of Concrete circuits.

Adjusting table lookup error probability is discussed extensively in Table lookup exactness section. The idea is to sacrifice exactness to gain performance.

For example:

import numpy as np
from concrete import fhe

def f(x, y):
    return (x // 2) * (y // 3)

inputset = fhe.inputset(fhe.uint4, fhe.uint4)
for p_error in [(1 / 1_000_000), (1 / 100_000), (1 / 10_000), (1 / 1_000), (1 / 100)]:
    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
    circuit = compiler.compile(inputset, p_error=p_error)
    print(f"p_error of {p_error:.6f} -> {int(circuit.complexity):_} complexity")

This prints:

p_error of 0.000001 -> 294_773_524 complexity
p_error of 0.000010 -> 286_577_520 complexity
p_error of 0.000100 -> 275_887_080 complexity
p_error of 0.001000 -> 265_196_640 complexity
p_error of 0.010000 -> 184_144_972 complexity

Composition

This guide explains how to optimize cryptographic parameters by specifying composition when using modules.

When using modules, make sure to specify composition so that the compiler can select more optimal parameters based on how the functions in the module would be used.

For example:

import numpy as np
from concrete import fhe


@fhe.module()
class PowerWithoutComposition:
    @fhe.function({"x": "encrypted"})
    def square(x):
        return x ** 2

    @fhe.function({"x": "encrypted"})
    def cube(x):
        return x ** 3

without_composition = PowerWithoutComposition.compile(
    {
        "square": fhe.inputset(fhe.uint2),
        "cube": fhe.inputset(fhe.uint4),
    }
)
print(f"without composition -> {int(without_composition.complexity):>10_} complexity")


@fhe.module()
class PowerWithComposition:
    @fhe.function({"x": "encrypted"})
    def square(x):
        return x ** 2

    @fhe.function({"x": "encrypted"})
    def cube(x):
        return x ** 3

    composition = fhe.Wired(
        [
            fhe.Wire(fhe.Output(square, 0), fhe.Input(cube, 0))
        ]
    )

with_composition = PowerWithComposition.compile(
    {
        "square": fhe.inputset(fhe.uint2),
        "cube": fhe.inputset(fhe.uint4),
    }
)
print(f"   with composition -> {int(with_composition.complexity):>10_} complexity")

This prints:

without composition -> 185_863_835 complexity
   with composition -> 135_871_612 complexity

It means that specifying composition resulted in ~35% improvement to complexity for computing cube(square(x)).

Dataflow parallelism

This guide explains dataflow parallelism and how it can improve the execution time of Concrete circuits.

For example:

import time

import numpy as np
from concrete import fhe

def f(x, y, z):
    # normally, you'd use fhe.array to construct a concrete tensor
    # but for this example, we just create a simple numpy array
    # so the matrix multiplication can happen on a cellular level
    a = np.array([[x, y], [z, 2]])
    b = np.array([[1, x], [z, y]])
    return fhe.array(a @ b)

inputset = fhe.inputset(fhe.uint3, fhe.uint3, fhe.uint3)

for dataflow_parallelize in [False, True]:
    compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted", "z": "encrypted"})
    circuit = compiler.compile(inputset, dataflow_parallelize=dataflow_parallelize)

    circuit.keygen()
    for sample in inputset[:3]:  # warmup
        circuit.encrypt_run_decrypt(*sample)

    timings = []
    for sample in inputset[3:13]:
        start = time.time()
        result = circuit.encrypt_run_decrypt(*sample)
        end = time.time()

        assert np.array_equal(result, f(*sample))
        timings.append(end - start)

    if not dataflow_parallelize:
        print(f"without dataflow parallelize -> {np.mean(timings):.03f}s")
    else:
        print(f"   with dataflow parallelize -> {np.mean(timings):.03f}s")

This prints:

without dataflow parallelize -> 0.609s
   with dataflow parallelize -> 0.414s

The reason for that is:

// this is the generated MLIR for the circuit
// without dataflow, every single line would be executed one after the other

module {
  func.func @main(%arg0: !FHE.eint<7>, %arg1: !FHE.eint<7>, %arg2: !FHE.eint<7>) -> tensor<2x2x!FHE.eint<7>> {
  
    // but if you look closely, you can see that this multiplication
    %c1_i2 = arith.constant 1 : i2
    %0 = "FHE.mul_eint_int"(%arg0, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
    
    // is completely independent of this one, so dataflow makes them run in parallel
    %1 = "FHE.mul_eint"(%arg1, %arg2) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // however, this addition needs the first two operations
    // so dataflow waits until both are done before performing this one
    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // lastly, this multiplication is completely independent from the first three operations
    // so its execution starts in parallel when execution starts with dataflow
    %3 = "FHE.mul_eint"(%arg0, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    
    // similar logic can be applied to the remaining operations...
    %4 = "FHE.mul_eint"(%arg1, %arg1) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %5 = "FHE.add_eint"(%3, %4) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %6 = "FHE.mul_eint_int"(%arg2, %c1_i2) : (!FHE.eint<7>, i2) -> !FHE.eint<7>
    %c2_i3 = arith.constant 2 : i3
    %7 = "FHE.mul_eint_int"(%arg2, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
    %8 = "FHE.add_eint"(%6, %7) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %9 = "FHE.mul_eint"(%arg2, %arg0) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %10 = "FHE.mul_eint_int"(%arg1, %c2_i3) : (!FHE.eint<7>, i3) -> !FHE.eint<7>
    %11 = "FHE.add_eint"(%9, %10) : (!FHE.eint<7>, !FHE.eint<7>) -> !FHE.eint<7>
    %from_elements = tensor.from_elements %2, %5, %8, %11 : tensor<2x2x!FHE.eint<7>>
    return %from_elements : tensor<2x2x!FHE.eint<7>>
    
  }
}

To summarize, dataflow analyzes the circuit to determine which parts of the circuit can be run at the same time, and tries to run as many operations as possible in parallel.