1 of 10

Advanced features

Table Lookups advanced

This document details the management of Table Lookups(TLU) within Concrete for advanced usage. For a simpler guide, refer to the Table Lookup Basics.

One of the most common operations in Concrete are Table Lookups (TLUs). All operations except addition, subtraction, multiplication with non-encrypted values, tensor manipulation operations, and a few operations built with those primitive operations (e.g. matmul, conv) are converted to Table Lookups under the hood.

Table Lookups are very flexible. They allow Concrete to support many operations, but they are expensive. The exact cost depends on many variables (hardware used, error probability, etc.), but they are always much more expensive compared to other operations. You should try to avoid them as much as possible. It's not always possible to avoid them completely, but you might remove the number of TLUs or replace some of them with other primitive operations.

Concrete automatically parallelizes TLUs if they are applied to tensors.

Direct table lookup

Concrete provides a LookupTable class to create your own tables and apply them in your circuits.

LookupTables can have any number of elements. Let's call the number of elements N. As long as the lookup variable is within the range [-N, N), the Table Lookup is valid.

If you go outside of this range, you will receive the following error:

IndexError: index 10 is out of bounds for axis 0 with size 6

With scalars.

You can create the lookup table using a list of integers and apply it using indexing:

from concrete import fhe

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = range(4)
circuit = f.compile(inputset)

assert circuit.encrypt_run_decrypt(0) == table[0] == 2
assert circuit.encrypt_run_decrypt(1) == table[1] == -1
assert circuit.encrypt_run_decrypt(2) == table[2] == 3
assert circuit.encrypt_run_decrypt(3) == table[3] == 0

With tensors.

When you apply a table lookup to a tensor, the scalar table lookup is applied to each element of the tensor:

from concrete import fhe
import numpy as np

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = [np.random.randint(0, 4, size=(2, 3)) for _ in range(10)]
circuit = f.compile(inputset)

sample = [
    [0, 1, 3],
    [2, 3, 1],
]
expected_output = [
    [2, -1, 0],
    [3, 0, -1],
]
actual_output = circuit.encrypt_run_decrypt(np.array(sample))

for i in range(2):
    for j in range(3):
        assert actual_output[i][j] == expected_output[i][j] == table[sample[i][j]]

With negative values.

LookupTable mimics array indexing in Python, which means if the lookup variable is negative, the table is looked up from the back:

from concrete import fhe

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[-x]

inputset = range(1, 5)
circuit = f.compile(inputset)

assert circuit.encrypt_run_decrypt(1) == table[-1] == 0
assert circuit.encrypt_run_decrypt(2) == table[-2] == 3
assert circuit.encrypt_run_decrypt(3) == table[-3] == -1
assert circuit.encrypt_run_decrypt(4) == table[-4] == 2

Direct multi-table lookup

If you want to apply a different lookup table to each element of a tensor, you can have a LookupTable of LookupTables:

from concrete import fhe
import numpy as np

squared = fhe.LookupTable([i ** 2 for i in range(4)])
cubed = fhe.LookupTable([i ** 3 for i in range(4)])

table = fhe.LookupTable([
    [squared, cubed],
    [squared, cubed],
    [squared, cubed],
])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = [np.random.randint(0, 4, size=(3, 2)) for _ in range(10)]
circuit = f.compile(inputset)

sample = [
    [0, 1],
    [2, 3],
    [3, 0],
]
expected_output = [
    [0, 1],
    [4, 27],
    [9, 0]
]
actual_output = circuit.encrypt_run_decrypt(np.array(sample))

for i in range(3):
    for j in range(2):
        if j == 0:
            assert actual_output[i][j] == expected_output[i][j] == squared[sample[i][j]]
        else:
            assert actual_output[i][j] == expected_output[i][j] == cubed[sample[i][j]]

In this example, we applied a squared table to the first column and a cubed table to the second column.

Fused table lookup

Concrete tries to fuse some operations into table lookups automatically so that lookup tables don't need to be created manually:

from concrete import fhe
import numpy as np

@fhe.compiler({"x": "encrypted"})
def f(x):
    return (42 * np.sin(x)).astype(np.int64) // 10

inputset = range(8)
circuit = f.compile(inputset)

for x in range(8):
    assert circuit.encrypt_run_decrypt(x) == f(x)

All lookup tables need to be from integers to integers. So, without .astype(np.int64), Concrete will not be able to fuse.

The function is first traced into:

Concrete then fuses appropriate nodes:

Fusing makes the code more readable and easier to modify, so try to utilize it over manual LookupTables as much as possible.

Using automatically created table lookup

We refer the users to this page for explanations about fhe.univariate(function) and fhe.multivariate(function) features, which are convenient ways to use automatically created table lookup.

Table lookup exactness

TLUs are performed with an FHE operation called Programmable Bootstrapping (PBS). PBSs have a certain probability of error: when these errors happen, it results in inaccurate results.

Let's say you have the table:

lut = [0, 1, 4, 9, 16, 25, 36, 49, 64]

And you perform a Table Lookup using 4. The result you should get is lut[4] = 16, but because of the possibility of error, you could get any other value in the table.

The probability of this error can be configured through the p_error and global_p_error configuration options. The difference between these two options is that, p_error is for individual TLUs but global_p_error is for the whole circuit.

If you set p_error to 0.01, for example, it means every TLU in the circuit will have a 99% chance (or more) of being exact. If there is a single TLU in the circuit, it corresponds to global_p_error = 0.01 as well. But if we have 2 TLUs, then global_p_error would be higher: that's 1 - (0.99 * 0.99) ~= 0.02 = 2%.

If you set global_p_error to 0.01, the whole circuit will have at most 1% probability of error, no matter how many Table Lookups are included (which means that p_error will be smaller than 0.01 if there are more than a single TLU).

If you set both of them, both will be satisfied. Essentially, the stricter one will be used.

By default, both p_error and global_p_error are set to None, which results in a global_p_error of 1 / 100_000 being used.

Feel free to play with these configuration options to pick the one best suited for your needs! See How to Configure to learn how you can set a custom p_error and/or global_p_error.

Configuring either of those variables impacts compilation and execution times (compilation, keys generation, circuit execution) and space requirements (size of the keys on disk and in memory). Lower error probabilities result in longer compilation and execution times and larger space requirements.

Table lookup performance

PBSs are very expensive, in terms of computations. Fortunately, it is sometimes possible to replace PBS by rounded PBS, truncate PBS or even approximate PBS. These TLUs have a slightly different semantic, but are very useful in cases like machine learning for more efficiency without drop of accuracy.

Rounding

This document details the concept of rounding, and how it is used in Concrete to make some FHE computations especially faster.

Table lookups have a strict constraint on the number of bits they support. This can be limiting, especially if you don't need exact precision. As well as this, using larger bit-widths leads to slower table lookups.

To overcome these issues, rounded table lookups are introduced. This operation provides a way to round the least significant bits of a large integer and then apply the table lookup on the resulting (smaller) value.

Imagine you have a 5-bit value, but you want to have a 3-bit table lookup. You can call fhe.round_bit_pattern(input, lsbs_to_remove=2) and use the 3-bit value you receive as input to the table lookup.

Let's see how rounding works in practice:

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

original_bit_width = 5
lsbs_to_remove = 2

assert 0 < lsbs_to_remove < original_bit_width

original_values = list(range(2**original_bit_width))
rounded_values = [
    fhe.round_bit_pattern(value, lsbs_to_remove)
    for value in original_values
]

previous_rounded = rounded_values[0]
for original, rounded in zip(original_values, rounded_values):
    if rounded != previous_rounded:
        previous_rounded = rounded
        print()

    original_binary = np.binary_repr(original, width=(original_bit_width + 1))
    rounded_binary = np.binary_repr(rounded, width=(original_bit_width + 1))

    print(
        f"{original:2} = 0b_{original_binary[:-lsbs_to_remove]}[{original_binary[-lsbs_to_remove:]}] "
        f"=> "
        f"0b_{rounded_binary[:-lsbs_to_remove]}[{rounded_binary[-lsbs_to_remove:]}] = {rounded}"
    )

fig = plt.figure()
ax = fig.add_subplot()

plt.plot(original_values, original_values, label="original", color="black")
plt.plot(original_values, rounded_values, label="rounded", color="green")
plt.legend()

ax.set_aspect("equal", adjustable="box")
plt.show()

prints:

 0 = 0b_0000[00] => 0b_0000[00] = 0
 1 = 0b_0000[01] => 0b_0000[00] = 0

 2 = 0b_0000[10] => 0b_0001[00] = 4
 3 = 0b_0000[11] => 0b_0001[00] = 4
 4 = 0b_0001[00] => 0b_0001[00] = 4
 5 = 0b_0001[01] => 0b_0001[00] = 4

 6 = 0b_0001[10] => 0b_0010[00] = 8
 7 = 0b_0001[11] => 0b_0010[00] = 8
 8 = 0b_0010[00] => 0b_0010[00] = 8
 9 = 0b_0010[01] => 0b_0010[00] = 8

10 = 0b_0010[10] => 0b_0011[00] = 12
11 = 0b_0010[11] => 0b_0011[00] = 12
12 = 0b_0011[00] => 0b_0011[00] = 12
13 = 0b_0011[01] => 0b_0011[00] = 12

14 = 0b_0011[10] => 0b_0100[00] = 16
15 = 0b_0011[11] => 0b_0100[00] = 16
16 = 0b_0100[00] => 0b_0100[00] = 16
17 = 0b_0100[01] => 0b_0100[00] = 16

18 = 0b_0100[10] => 0b_0101[00] = 20
19 = 0b_0100[11] => 0b_0101[00] = 20
20 = 0b_0101[00] => 0b_0101[00] = 20
21 = 0b_0101[01] => 0b_0101[00] = 20

22 = 0b_0101[10] => 0b_0110[00] = 24
23 = 0b_0101[11] => 0b_0110[00] = 24
24 = 0b_0110[00] => 0b_0110[00] = 24
25 = 0b_0110[01] => 0b_0110[00] = 24

26 = 0b_0110[10] => 0b_0111[00] = 28
27 = 0b_0110[11] => 0b_0111[00] = 28
28 = 0b_0111[00] => 0b_0111[00] = 28
29 = 0b_0111[01] => 0b_0111[00] = 28

30 = 0b_0111[10] => 0b_1000[00] = 32
31 = 0b_0111[11] => 0b_1000[00] = 32

and displays:

If the rounded number is one of the last 2**(lsbs_to_remove - 1) numbers in the input range [0, 2**original_bit_width), an overflow will happen.

By default, if an overflow is encountered during inputset evaluation, bit-widths will be adjusted accordingly. This results in a loss of speed, but ensures accuracy.

You can turn this overflow protection off (e.g., for performance) by using fhe.round_bit_pattern(..., overflow_protection=False). However, this could lead to unexpected behavior at runtime.

Now, let's see how rounding can be used in FHE.

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    single_precision=False,
    parameter_selection_strategy=fhe.ParameterSelectionStrategy.MULTI,
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for lsbs_to_remove in range(input_bit_width):
    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.round_bit_pattern(x, lsbs_to_remove) ** 2
    
    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()
    
    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)
    
    took = end - start
    
    timings[lsbs_to_remove] = took
    results[lsbs_to_remove] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[0]
for lsbs_to_remove in range(input_bit_width):
    timing = timings[lsbs_to_remove]
    speedup = baseline / timing
    print(f"lsbs_to_remove={lsbs_to_remove} => {speedup:.2f}x speedup")

    axs[lsbs_to_remove].set_title(f"lsbs_to_remove={lsbs_to_remove}")
    axs[lsbs_to_remove].plot(input_range, results[lsbs_to_remove])

plt.show()

prints:

lsbs_to_remove=0 => 1.00x speedup
lsbs_to_remove=1 => 1.20x speedup
lsbs_to_remove=2 => 2.17x speedup
lsbs_to_remove=3 => 3.75x speedup
lsbs_to_remove=4 => 2.64x speedup
lsbs_to_remove=5 => 2.61x speedup

These speed-ups can vary from system to system.

The reason why the speed-up is not increasing with lsbs_to_remove is because the rounding operation itself has a cost: each bit removal is a PBS. Therefore, if a lot of bits are removed, rounding itself could take longer than the bigger TLU which is evaluated afterwards.

and displays:

Feel free to disable overflow protection and see what happens.

Auto Rounders

Rounding is very useful but, in some cases, you don't know how many bits your input contains, so it's not reliable to specify lsbs_to_remove manually. For this reason, the AutoRounder class is introduced.

AutoRounder allows you to set how many of the most significant bits to keep, but they need to be adjusted using an inputset to determine how many of the least significant bits to remove. This can be done manually using fhe.AutoRounder.adjust(function, inputset), or by setting auto_adjust_rounders configuration to True during compilation.

Here is how auto rounders can be used in FHE:

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    single_precision=False,
    parameter_selection_strategy=fhe.ParameterSelectionStrategy.MULTI,
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for target_msbs in reversed(range(1, input_bit_width + 1)):
    rounder = fhe.AutoRounder(target_msbs)

    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.round_bit_pattern(x, rounder) ** 2

    fhe.AutoRounder.adjust(f, inputset=[input_range])

    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()

    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)

    took = end - start

    timings[target_msbs] = took
    results[target_msbs] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[input_bit_width]
for i, target_msbs in enumerate(reversed(range(1, input_bit_width + 1))):
    timing = timings[target_msbs]
    speedup = baseline / timing
    print(f"target_msbs={target_msbs} => {speedup:.2f}x speedup")

    axs[i].set_title(f"target_msbs={target_msbs}")
    axs[i].plot(input_range, results[target_msbs])

plt.show()

prints:

target_msbs=6 => 1.00x speedup
target_msbs=5 => 1.22x speedup
target_msbs=4 => 1.95x speedup
target_msbs=3 => 3.11x speedup
target_msbs=2 => 2.23x speedup
target_msbs=1 => 2.34x speedup

and displays:

AutoRounders should be defined outside the function that is being compiled. They are used to store the result of the adjustment process, so they shouldn't be created each time the function is called. Furthermore, each AutoRounder should be used with exactly one round_bit_pattern call.

Exactness

One use of rounding is doing faster computation by ignoring the lower significant bits. For this usage, you can even get faster results if you accept the rounding it-self to be slightly inexact. The speedup is usually around 2x-3x but can be higher for big precision reduction. This also enable higher precisions values that are not possible otherwise.

You can turn on this mode either globally on the configuration:

configuration = fhe.Configuration(
    ...
    rounding_exactness=fhe.Exactness.APPROXIMATE
)

or on/off locally:

v = fhe.round_bit_pattern(v, lsbs_to_remove=2, exactness=fhe.Exactness.APPROXIMATE)
v = fhe.round_bit_pattern(v, lsbs_to_remove=2, exactness=fhe.Exactness.EXACT)

In approximate mode the rounding threshold up or down is not perfectly centered: The off-centering is:

is bounded, i.e. at worst an off-by-one on the reduced precision value compared to the exact result,
is pseudo-random, i.e. it will be different on each call,
almost symmetrically distributed,
depends on cryptographic properties like the encryption mask, the encryption noise and the crypto-parameters.

Approximate rounding features

With approximate rounding, you can enable an approximate clipping to get further improve performance in the case of overflow handling. Approximate clipping enable to discard the extra bit of overflow protection bit in the successor TLU. For consistency a logical clipping is available when this optimization is not suitable.

Logical clipping

When fast approximate clipping is not suitable (i.e. slower), it's better to apply logical clipping for consistency and better resilience to code change. It has no extra cost since it's fuzed with the successor TLU.

Approximate clipping

This set the first precision where approximate clipping is enabled, starting from this precision, an extra small precision TLU is introduced to safely remove the extra precision bit used to contain overflow. This way the successor TLU is faster. E.g. for a rounding to 7bits, that finishes to a TLU of 8bits due to overflow, forcing to use a TLU of 7bits is 3x faster.

Truncating

This document details the concept of truncating, and how it is used in Concrete to make some FHE computations especially faster.

To overcome these issues, truncated table lookups are introduced. This operation provides a way to zero the least significant bits of a large integer and then apply the table lookup on the resulting (smaller) value.

Imagine you have a 5-bit value, you can use fhe.truncate_bit_pattern(value, lsbs_to_remove=2) to truncate it (here the last 2 bits are discarded). Once truncated, value will remain in 5-bits (e.g., 22 = 0b10110 would be truncated to 20 = 0b10100), and the last 2 bits of it would be zero. Concrete uses this to optimize table lookups on the truncated value, the 5-bit table lookup gets optimized to a 3-bit table lookup, which is much faster!

Let's see how truncation works in practice:

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

original_bit_width = 5
lsbs_to_remove = 2

assert 0 < lsbs_to_remove < original_bit_width

original_values = list(range(2**original_bit_width))
truncated_values = [
    fhe.truncate_bit_pattern(value, lsbs_to_remove)
    for value in original_values
]

previous_truncated = truncated_values[0]
for original, truncated in zip(original_values, truncated_values):
    if truncated != previous_truncated:
        previous_truncated = truncated
        print()

    original_binary = np.binary_repr(original, width=(original_bit_width + 1))
    truncated_binary = np.binary_repr(truncated, width=(original_bit_width + 1))

    print(
        f"{original:2} = 0b_{original_binary[:-lsbs_to_remove]}[{original_binary[-lsbs_to_remove:]}] "
        f"=> "
        f"0b_{truncated_binary[:-lsbs_to_remove]}[{truncated_binary[-lsbs_to_remove:]}] = {truncated}"
    )

fig = plt.figure()
ax = fig.add_subplot()

plt.plot(original_values, original_values, label="original", color="black")
plt.plot(original_values, truncated_values, label="truncated", color="green")
plt.legend()

ax.set_aspect("equal", adjustable="box")
plt.show()

prints:

 0 = 0b_0000[00] => 0b_0000[00] = 0
 1 = 0b_0000[01] => 0b_0000[00] = 0
 2 = 0b_0000[10] => 0b_0000[00] = 0
 3 = 0b_0000[11] => 0b_0000[00] = 0

 4 = 0b_0001[00] => 0b_0001[00] = 4
 5 = 0b_0001[01] => 0b_0001[00] = 4
 6 = 0b_0001[10] => 0b_0001[00] = 4
 7 = 0b_0001[11] => 0b_0001[00] = 4

 8 = 0b_0010[00] => 0b_0010[00] = 8
 9 = 0b_0010[01] => 0b_0010[00] = 8
10 = 0b_0010[10] => 0b_0010[00] = 8
11 = 0b_0010[11] => 0b_0010[00] = 8

12 = 0b_0011[00] => 0b_0011[00] = 12
13 = 0b_0011[01] => 0b_0011[00] = 12
14 = 0b_0011[10] => 0b_0011[00] = 12
15 = 0b_0011[11] => 0b_0011[00] = 12

16 = 0b_0100[00] => 0b_0100[00] = 16
17 = 0b_0100[01] => 0b_0100[00] = 16
18 = 0b_0100[10] => 0b_0100[00] = 16
19 = 0b_0100[11] => 0b_0100[00] = 16

20 = 0b_0101[00] => 0b_0101[00] = 20
21 = 0b_0101[01] => 0b_0101[00] = 20
22 = 0b_0101[10] => 0b_0101[00] = 20
23 = 0b_0101[11] => 0b_0101[00] = 20

24 = 0b_0110[00] => 0b_0110[00] = 24
25 = 0b_0110[01] => 0b_0110[00] = 24
26 = 0b_0110[10] => 0b_0110[00] = 24
27 = 0b_0110[11] => 0b_0110[00] = 24

28 = 0b_0111[00] => 0b_0111[00] = 28
29 = 0b_0111[01] => 0b_0111[00] = 28
30 = 0b_0111[10] => 0b_0111[00] = 28
31 = 0b_0111[11] => 0b_0111[00] = 28

and displays:

Now, let's see how truncating can be used in FHE.

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for lsbs_to_remove in range(input_bit_width):
    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.truncate_bit_pattern(x, lsbs_to_remove) ** 2
    
    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()
    
    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)
    
    took = end - start
    
    timings[lsbs_to_remove] = took
    results[lsbs_to_remove] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[0]
for lsbs_to_remove in range(input_bit_width):
    timing = timings[lsbs_to_remove]
    speedup = baseline / timing
    print(f"lsbs_to_remove={lsbs_to_remove} => {speedup:.2f}x speedup")

    axs[lsbs_to_remove].set_title(f"lsbs_to_remove={lsbs_to_remove}")
    axs[lsbs_to_remove].plot(input_range, results[lsbs_to_remove])

plt.show()

prints:

lsbs_to_remove=0 => 1.00x speedup
lsbs_to_remove=1 => 1.69x speedup
lsbs_to_remove=2 => 3.48x speedup
lsbs_to_remove=3 => 3.06x speedup
lsbs_to_remove=4 => 3.46x speedup
lsbs_to_remove=5 => 3.14x speedup

These speed-ups can vary from system to system.

The reason why the speed-up is not increasing with lsbs_to_remove is because the truncating operation itself has a cost: each bit removal is a PBS. Therefore, if a lot of bits are removed, truncation itself could take longer than the bigger TLU which is evaluated afterwards.

and displays:

Auto Truncators

Truncating is very useful but, in some cases, you don't know how many bits your input contains, so it's not reliable to specify lsbs_to_remove manually. For this reason, the AutoTruncator class is introduced.

AutoTruncator allows you to set how many of the most significant bits to keep, but they need to be adjusted using an inputset to determine how many of the least significant bits to remove. This can be done manually using fhe.AutoTruncator.adjust(function, inputset), or by setting auto_adjust_truncators configuration to True during compilation.

Here is how auto truncators can be used in FHE:

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    single_precision=False,
    parameter_selection_strategy=fhe.ParameterSelectionStrategy.MULTI,
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for target_msbs in reversed(range(1, input_bit_width + 1)):
    truncator = fhe.AutoTruncator(target_msbs)

    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.truncate_bit_pattern(x, lsbs_to_remove=truncator) ** 2

    fhe.AutoTruncator.adjust(f, inputset=[input_range])

    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()

    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)

    took = end - start

    timings[target_msbs] = took
    results[target_msbs] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[input_bit_width]
for i, target_msbs in enumerate(reversed(range(1, input_bit_width + 1))):
    timing = timings[target_msbs]
    speedup = baseline / timing
    print(f"target_msbs={target_msbs} => {speedup:.2f}x speedup")

    axs[i].set_title(f"target_msbs={target_msbs}")
    axs[i].plot(input_range, results[target_msbs])

plt.show()

prints:

target_msbs=6 => 1.00x speedup
target_msbs=5 => 1.80x speedup
target_msbs=4 => 3.47x speedup
target_msbs=3 => 3.02x speedup
target_msbs=2 => 3.38x speedup
target_msbs=1 => 3.37x speedup

and displays:

AutoTruncators should be defined outside the function that is being compiled. They are used to store the result of the adjustment process, so they shouldn't be created each time the function is called. Furthermore, each AutoTruncator should be used with exactly one truncate_bit_pattern call.

Floating points

This document describes how floating points are treated and manipulated in Concrete.

Concrete partly supports floating points. There is no support for floating point inputs or outputs. However, there is support for intermediate values to be floating points (under certain constraints). Also, we note that one can use an equivalent of fixed points in Concrete, as described in our tutorial.

Floating points as intermediate values

Concrete-Compile, which is used for compiling the circuit, doesn't support floating points at all. However, it supports table lookups which take an integer and map it to another integer. The constraints of this operation are that there should be a single integer input, and a single integer output.

As long as your floating point operations comply with those constraints, Concrete automatically converts them to a table lookup operation:

from concrete import fhe
import numpy as np

@fhe.compiler({"x": "encrypted"})
def f(x):
    a = x + 1.5
    b = np.sin(x)
    c = np.around(a + b)
    d = c.astype(np.int64)
    return d

inputset = range(8)
circuit = f.compile(inputset)

for x in range(8):
    assert circuit.encrypt_run_decrypt(x) == f(x)

In the example above, a, b, and c are floating point intermediates. They are used to calculate d, which is an integer with a value dependent upon x, which is also an integer. Concrete detects this and fuses all of these operations into a single table lookup from x to d.

This approach works for a variety of use cases, but it comes up short for others:

from concrete import fhe
import numpy as np

@fhe.compiler({"x": "encrypted", "y": "encrypted"})
def f(x, y):
    a = x + 1.5
    b = np.sin(y)
    c = np.around(a + b)
    d = c.astype(np.int64)
    return d

inputset = [(1, 2), (3, 0), (2, 2), (1, 3)]
circuit = f.compile(inputset)

for x in range(8):
    assert circuit.encrypt_run_decrypt(x) == f(x)

This results in:

RuntimeError: Function you are trying to compile cannot be converted to MLIR

%0 = x                             # EncryptedScalar<uint2>
%1 = 1.5                           # ClearScalar<float64>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only integer constants are supported
%2 = y                             # EncryptedScalar<uint2>
%3 = add(%0, %1)                   # EncryptedScalar<float64>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only integer operations are supported
%4 = sin(%2)                       # EncryptedScalar<float64>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only integer operations are supported
%5 = add(%3, %4)                   # EncryptedScalar<float64>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only integer operations are supported
%6 = around(%5)                    # EncryptedScalar<float64>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ only integer operations are supported
%7 = astype(%6, dtype=int_)        # EncryptedScalar<uint3>
return %7

The reason for the error is that d no longer depends solely on x; it depends on y as well. Concrete cannot fuse these operations, so it raises an exception instead.

Min/Max operations

This document explains how to compute minimum and maximum between values in Concrete, covering different strategies to make computations faster, depending on the strategy.

Finding the minimum or maximum of two numbers is not a native operation in Concrete, so it needs to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers two different implementations for this.

Chunked

This is the most general implementation that can be used in any situation. The idea is:

Notes

Multiplication with operands aren't allowed to increase the bit-width of the inputs, so they are very expensive as well.
Optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked comparisons result in at least 9 and at most 21 table lookups.
It is used if no other implementation can be used.

Pros

Can be used with any integers.

Cons

Extremely expensive.

Example

produces

Min/Max Trick

This implementation uses the fact that [min,max](x, y) is equal to [min, max](x - y, 0) + y, which is just a subtraction, a table lookup and an addition!

There are two major problems with this implementation though:

subtraction before the TLU requires up to 2 additional bits to avoid overflows (it is 1 in most cases).
subtraction and addition require the same bit-width across operands.

What this means is that if we are comparing uint3 and uint6, we need to convert both of them to uint7 in some way to do the subtraction and proceed with the TLU in 7-bits. There are 2 ways to achieve this behavior.

Requirements

1. fhe.MinMaxStrategy.ONE_TLU_PROMOTED

This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the amount of bits required to store x - y. The idea is:

Pros

It will always result in a single table lookup.

Cons

It will increase the bit-width of both operands and the result, and lock them together across the whole circuit, which can result in significant slowdowns if the result or the operands are used in other costly operations.

Example

produces

2. fhe.MinMaxStrategy.THREE_TLU_CASTED

This strategy will not put any constraint on bit-widths during bit-width assignment. Instead, operands are cast to a bit-width that can store x - y during runtime using table lookups. The idea is:

Notes

It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store x - y.
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store x - y.

Pros

It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.

Cons

If you are not doing anything else with the operands, or doing less costly operations compared to comparison, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.MinMaxStrategy.ONE_TLU_PROMOTED.

Example

produces

Summary

Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.

Different strategies are good for different circuits. If you want the best runtime for your use case, you can compile your circuit with all different comparison strategy preferences, and pick the one with the lowest complexity.

Direct circuits

This document explains the concept of direct circuits in Concrete, which is another way to compile circuit without having to give a proper inputset.

Direct circuits are still experimental. It is very easy to make mistakes (e.g., due to no overflow checks or type coercion) while using direct circuits, so utilize them with care.

For some applications, the data types of inputs, intermediate values, and outputs are known (e.g., for manipulating bytes, you would want to use uint8). Using inputsets to determine bounds in these cases is not necessary, and can even be error-prone. Therefore, another interface for defining such circuits is introduced:

from concrete import fhe

@fhe.circuit({"x": "encrypted"})
def circuit(x: fhe.uint8):
    return x + 42

assert circuit.encrypt_run_decrypt(10) == 52

There are a few differences between direct circuits and traditional circuits:

Remember that the resulting dtype for each operation will be determined by its inputs. This can lead to some unexpected results if you're not careful (e.g., if you do -x where x: fhe.uint8, you won't receive a negative value as the result will be fhe.uint8 as well)
There is no inputset evaluation when using fhe types in .astype(...) calls (e.g., np.sqrt(x).astype(fhe.uint4)), so the bit width of the output cannot be determined.
Specify the resulting data type in univariate extension (e.g., fhe.univariate(function, outputs=fhe.uint4)(x)), for the same reason as above.
Be careful with overflows. With inputset evaluation, you'll get bigger bit widths but no overflows. With direct definition, you must ensure that there aren't any overflows manually.

Let's review a more complicated example to see how direct circuits behave:

from concrete import fhe
import numpy as np

def square(value):
    return value ** 2

@fhe.circuit({"x": "encrypted", "y": "encrypted"})
def circuit(x: fhe.uint8, y: fhe.int2):
    a = x + 10
    b = y + 10

    c = np.sqrt(a).round().astype(fhe.uint4)
    d = fhe.univariate(square, outputs=fhe.uint8)(b)

    return d - c

print(circuit)

This prints:

%0 = x                       # EncryptedScalar<uint8>
%1 = y                       # EncryptedScalar<int2>
%2 = 10                      # ClearScalar<uint4>
%3 = add(%0, %2)             # EncryptedScalar<uint8>
%4 = 10                      # ClearScalar<uint4>
%5 = add(%1, %4)             # EncryptedScalar<int4>
%6 = subgraph(%3)            # EncryptedScalar<uint4>
%7 = square(%5)              # EncryptedScalar<uint8>
%8 = subtract(%7, %6)        # EncryptedScalar<uint8>
return %8

Subgraphs:

    %6 = subgraph(%3):

        %0 = input                         # EncryptedScalar<uint8>
        %1 = sqrt(%0)                      # EncryptedScalar<float64>
        %2 = around(%1, decimals=0)        # EncryptedScalar<float64>
        %3 = astype(%2)                    # EncryptedScalar<uint4>
        return %3

Here is the breakdown of the assigned data types:

%0 is uint8 because it's specified in the definition
%1 is  int2 because it's specified in the definition
%2 is uint4 because it's the constant 10
%3 is uint8 because it's the addition between uint8 and uint4
%4 is uint4 because it's the constant 10
%5 is  int4 because it's the addition between int2 and uint4
%6 is uint4 because it's specified in astype
%7 is uint8 because it's specified in univariate
%8 is uint8 because it's subtraction between uint8 and uint4

As you can see, %8 is subtraction of two unsigned values, and the result is unsigned as well. In the case that c > d, we have an overflow, and this results in undefined behavior.

Tagging

This document explains the concept of tagging, which is a debugging tool to make a link between the user's Python code and the Concrete MLIR circuits. Such a link can be useful when an issue is raised by the compiler on some MLIR, to know which Python code it corresponds to.

When you have big circuits, keeping track of which node corresponds to which part of your code becomes difficult. A tagging system can simplify such situations:

When you compile f with inputset of range(10), you get the following graph:

If you get an error, you'll see exactly where the error occurred (e.g., which layer of the neural network, if you tag layers).

In the future, we plan to use tags for additional features (e.g., to measure performance of tagged regions), so it's a good idea to start utilizing them for big circuits.

Rounding

This document details the concept of rounding, and how it is used in Concrete to make some FHE computations especially faster.

Let's see how rounding works in practice:

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

original_bit_width = 5
lsbs_to_remove = 2

assert 0 < lsbs_to_remove < original_bit_width

original_values = list(range(2**original_bit_width))
rounded_values = [
    fhe.round_bit_pattern(value, lsbs_to_remove)
    for value in original_values
]

previous_rounded = rounded_values[0]
for original, rounded in zip(original_values, rounded_values):
    if rounded != previous_rounded:
        previous_rounded = rounded
        print()

    original_binary = np.binary_repr(original, width=(original_bit_width + 1))
    rounded_binary = np.binary_repr(rounded, width=(original_bit_width + 1))

    print(
        f"{original:2} = 0b_{original_binary[:-lsbs_to_remove]}[{original_binary[-lsbs_to_remove:]}] "
        f"=> "
        f"0b_{rounded_binary[:-lsbs_to_remove]}[{rounded_binary[-lsbs_to_remove:]}] = {rounded}"
    )

fig = plt.figure()
ax = fig.add_subplot()

plt.plot(original_values, original_values, label="original", color="black")
plt.plot(original_values, rounded_values, label="rounded", color="green")
plt.legend()

ax.set_aspect("equal", adjustable="box")
plt.show()

prints:

 0 = 0b_0000[00] => 0b_0000[00] = 0
 1 = 0b_0000[01] => 0b_0000[00] = 0

 2 = 0b_0000[10] => 0b_0001[00] = 4
 3 = 0b_0000[11] => 0b_0001[00] = 4
 4 = 0b_0001[00] => 0b_0001[00] = 4
 5 = 0b_0001[01] => 0b_0001[00] = 4

 6 = 0b_0001[10] => 0b_0010[00] = 8
 7 = 0b_0001[11] => 0b_0010[00] = 8
 8 = 0b_0010[00] => 0b_0010[00] = 8
 9 = 0b_0010[01] => 0b_0010[00] = 8

10 = 0b_0010[10] => 0b_0011[00] = 12
11 = 0b_0010[11] => 0b_0011[00] = 12
12 = 0b_0011[00] => 0b_0011[00] = 12
13 = 0b_0011[01] => 0b_0011[00] = 12

14 = 0b_0011[10] => 0b_0100[00] = 16
15 = 0b_0011[11] => 0b_0100[00] = 16
16 = 0b_0100[00] => 0b_0100[00] = 16
17 = 0b_0100[01] => 0b_0100[00] = 16

18 = 0b_0100[10] => 0b_0101[00] = 20
19 = 0b_0100[11] => 0b_0101[00] = 20
20 = 0b_0101[00] => 0b_0101[00] = 20
21 = 0b_0101[01] => 0b_0101[00] = 20

22 = 0b_0101[10] => 0b_0110[00] = 24
23 = 0b_0101[11] => 0b_0110[00] = 24
24 = 0b_0110[00] => 0b_0110[00] = 24
25 = 0b_0110[01] => 0b_0110[00] = 24

26 = 0b_0110[10] => 0b_0111[00] = 28
27 = 0b_0110[11] => 0b_0111[00] = 28
28 = 0b_0111[00] => 0b_0111[00] = 28
29 = 0b_0111[01] => 0b_0111[00] = 28

30 = 0b_0111[10] => 0b_1000[00] = 32
31 = 0b_0111[11] => 0b_1000[00] = 32

and displays:

If the rounded number is one of the last 2**(lsbs_to_remove - 1) numbers in the input range [0, 2**original_bit_width), an overflow will happen.

By default, if an overflow is encountered during inputset evaluation, bit-widths will be adjusted accordingly. This results in a loss of speed, but ensures accuracy.

You can turn this overflow protection off (e.g., for performance) by using fhe.round_bit_pattern(..., overflow_protection=False). However, this could lead to unexpected behavior at runtime.

Now, let's see how rounding can be used in FHE.

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    single_precision=False,
    parameter_selection_strategy=fhe.ParameterSelectionStrategy.MULTI,
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for lsbs_to_remove in range(input_bit_width):
    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.round_bit_pattern(x, lsbs_to_remove) ** 2
    
    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()
    
    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)
    
    took = end - start
    
    timings[lsbs_to_remove] = took
    results[lsbs_to_remove] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[0]
for lsbs_to_remove in range(input_bit_width):
    timing = timings[lsbs_to_remove]
    speedup = baseline / timing
    print(f"lsbs_to_remove={lsbs_to_remove} => {speedup:.2f}x speedup")

    axs[lsbs_to_remove].set_title(f"lsbs_to_remove={lsbs_to_remove}")
    axs[lsbs_to_remove].plot(input_range, results[lsbs_to_remove])

plt.show()

prints:

lsbs_to_remove=0 => 1.00x speedup
lsbs_to_remove=1 => 1.20x speedup
lsbs_to_remove=2 => 2.17x speedup
lsbs_to_remove=3 => 3.75x speedup
lsbs_to_remove=4 => 2.64x speedup
lsbs_to_remove=5 => 2.61x speedup

These speed-ups can vary from system to system.

and displays:

Feel free to disable overflow protection and see what happens.

Auto Rounders

Here is how auto rounders can be used in FHE:

import itertools
import time

import matplotlib.pyplot as plt
import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    enable_unsafe_features=True,
    use_insecure_key_cache=True,
    insecure_key_cache_location=".keys",
    single_precision=False,
    parameter_selection_strategy=fhe.ParameterSelectionStrategy.MULTI,
)

input_bit_width = 6
input_range = np.array(range(2**input_bit_width))

timings = {}
results = {}

for target_msbs in reversed(range(1, input_bit_width + 1)):
    rounder = fhe.AutoRounder(target_msbs)

    @fhe.compiler({"x": "encrypted"})
    def f(x):
        return fhe.round_bit_pattern(x, rounder) ** 2

    fhe.AutoRounder.adjust(f, inputset=[input_range])

    circuit = f.compile(inputset=[input_range], configuration=configuration)
    circuit.keygen()

    encrypted_sample = circuit.encrypt(input_range)
    start = time.time()
    encrypted_result = circuit.run(encrypted_sample)
    end = time.time()
    result = circuit.decrypt(encrypted_result)

    took = end - start

    timings[target_msbs] = took
    results[target_msbs] = result

number_of_figures = len(results)

columns = 1
for i in range(2, number_of_figures):
    if number_of_figures % i == 0:
        columns = i
rows = number_of_figures // columns

fig, axs = plt.subplots(rows, columns)
axs = axs.flatten()

baseline = timings[input_bit_width]
for i, target_msbs in enumerate(reversed(range(1, input_bit_width + 1))):
    timing = timings[target_msbs]
    speedup = baseline / timing
    print(f"target_msbs={target_msbs} => {speedup:.2f}x speedup")

    axs[i].set_title(f"target_msbs={target_msbs}")
    axs[i].plot(input_range, results[target_msbs])

plt.show()

prints:

target_msbs=6 => 1.00x speedup
target_msbs=5 => 1.22x speedup
target_msbs=4 => 1.95x speedup
target_msbs=3 => 3.11x speedup
target_msbs=2 => 2.23x speedup
target_msbs=1 => 2.34x speedup

and displays:

Exactness

You can turn on this mode either globally on the configuration:

configuration = fhe.Configuration(
    ...
    rounding_exactness=fhe.Exactness.APPROXIMATE
)

or on/off locally:

v = fhe.round_bit_pattern(v, lsbs_to_remove=2, exactness=fhe.Exactness.APPROXIMATE)
v = fhe.round_bit_pattern(v, lsbs_to_remove=2, exactness=fhe.Exactness.EXACT)

In approximate mode the rounding threshold up or down is not perfectly centered: The off-centering is:

is bounded, i.e. at worst an off-by-one on the reduced precision value compared to the exact result,
is pseudo-random, i.e. it will be different on each call,
almost symmetrically distributed,
depends on cryptographic properties like the encryption mask, the encryption noise and the crypto-parameters.

Approximate rounding features

Logical clipping

Approximate clipping

Min/Max operations

This document explains how to compute minimum and maximum between values in Concrete, covering different strategies to make computations faster, depending on the strategy.

Chunked

This is the most general implementation that can be used in any situation. The idea is:

Notes

Initial comparison is as well, which is already very expensive.
Multiplication with operands aren't allowed to increase the bit-width of the inputs, so they are very expensive as well.
Optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked comparisons result in at least 9 and at most 21 table lookups.
It is used if no other implementation can be used.

Pros

Can be used with any integers.

Cons

Extremely expensive.

Example

import numpy as np
from concrete import fhe

def f(x, y):
    return np.minimum(x, y)

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, show_mlir=True)

produces

module {

  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<4>) -> !FHE.eint<4> {
  
    // calculating select_x, which is x < y since we're computing the minimum
    %cst = arith.constant dense<[0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]> : tensor<16xi64>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst_0) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_1 = arith.constant dense<[0, 1, 1, 1, 2, 0, 1, 1, 2, 2, 0, 1, 2, 2, 2, 0]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %cst_2 = arith.constant dense<[0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12]> : tensor<16xi64>
    %4 = "FHE.apply_lookup_table"(%arg0, %cst_2) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %cst_3 = arith.constant dense<[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]> : tensor<16xi64>
    %5 = "FHE.apply_lookup_table"(%arg1, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %6 = "FHE.add_eint"(%4, %5) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_4 = arith.constant dense<[0, 4, 4, 4, 8, 0, 4, 4, 8, 8, 0, 4, 8, 8, 8, 0]> : tensor<16xi64>
    %7 = "FHE.apply_lookup_table"(%6, %cst_4) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    %8 = "FHE.add_eint"(%7, %3) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_5 = arith.constant dense<[0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]> : tensor<16xi64>
    %9 = "FHE.apply_lookup_table"(%8, %cst_5) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<1>
    
    // extracting the first 2 bits of x shifhted to left by 1 bits for packing
    %cst_6 = arith.constant dense<[0, 2, 4, 6, 0, 2, 4, 6, 0, 2, 4, 6, 0, 2, 4, 6]> : tensor<16xi64>
    %10 = "FHE.apply_lookup_table"(%arg0, %cst_6) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<3>
    
    // casting select_x to 3 bits for packing
    %cst_7 = arith.constant dense<[0, 1]> : tensor<2xi64>
    %11 = "FHE.apply_lookup_table"(%9, %cst_7) : (!FHE.eint<1>, tensor<2xi64>) -> !FHE.eint<3>
    
    // packing the first 2 bits of x with select_x
    %12 = "FHE.add_eint"(%10, %11) : (!FHE.eint<3>, !FHE.eint<3>) -> !FHE.eint<3>
    
    // calculating contribution of 0 if select_x is 0 else the first 2 bits of x
    %cst_8 = arith.constant dense<[0, 0, 0, 1, 0, 2, 0, 3]> : tensor<8xi64>
    %13 = "FHE.apply_lookup_table"(%12, %cst_8) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    
    // extracting the last 2 bits of x shifhted to the left by 1 bit for packing
    %cst_9 = arith.constant dense<[0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6]> : tensor<16xi64>
    %14 = "FHE.apply_lookup_table"(%arg0, %cst_9) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<3>
    
    // packing the last 2 bits of x with select_x
    %15 = "FHE.add_eint"(%14, %11) : (!FHE.eint<3>, !FHE.eint<3>) -> !FHE.eint<3>
    
    // calculating contribution of 0 if select_x is 0 else the last 2 bits of x shifted by 2 bits for direct addition
    %cst_10 = arith.constant dense<[0, 0, 0, 4, 0, 8, 0, 12]> : tensor<8xi64>
    %16 = "FHE.apply_lookup_table"(%15, %cst_10) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    
    // computing x * select_x
    %17 = "FHE.add_eint"(%13, %16) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    
    // extracting the first 2 bits of y shifhted to the left by 1 bit for packing
    %18 = "FHE.apply_lookup_table"(%arg1, %cst_6) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<3>
    
    // packing the first 2 bits of y with select_x
    %19 = "FHE.add_eint"(%18, %11) : (!FHE.eint<3>, !FHE.eint<3>) -> !FHE.eint<3>
    
    // calculating contribution of 0 if select_x is 1 else the first 2 bits of y
    %cst_11 = arith.constant dense<[0, 0, 1, 0, 2, 0, 3, 0]> : tensor<8xi64>
    %20 = "FHE.apply_lookup_table"(%19, %cst_11) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    
    // extracting the last 2 bits of y shifhted to left by 1 bit for packing
    %21 = "FHE.apply_lookup_table"(%arg1, %cst_9) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<3>
    
    // packing the last 2 bits of y with select_x
    %22 = "FHE.add_eint"(%21, %11) : (!FHE.eint<3>, !FHE.eint<3>) -> !FHE.eint<3>
    
    // calculating contribution of 0 if select_x is 1 else the last 2 bits of y shifted by 2 bits for direct addition
    %cst_12 = arith.constant dense<[0, 0, 4, 0, 8, 0, 12, 0]> : tensor<8xi64>
    %23 = "FHE.apply_lookup_table"(%22, %cst_12) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    
    // computing y * (1 - select_x)
    %24 = "FHE.add_eint"(%20, %23) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    
    // computing the result
    %25 = "FHE.add_eint"(%17, %24) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>

    return %25 : !FHE.eint<4>
    
  }
  
}

Min/Max Trick

This implementation uses the fact that [min,max](x, y) is equal to [min, max](x - y, 0) + y, which is just a subtraction, a table lookup and an addition!

There are two major problems with this implementation though:

subtraction before the TLU requires up to 2 additional bits to avoid overflows (it is 1 in most cases).
subtraction and addition require the same bit-width across operands.

Requirements

(x - y).bit_width <= MAXIMUM_TLU_BIT_WIDTH

1. fhe.MinMaxStrategy.ONE_TLU_PROMOTED

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_promoted_to_uint7 - y_promoted_to_uint7] + y_promoted_to_uint7

Pros

It will always result in a single table lookup.

Cons

It will increase the bit-width of both operands and the result, and lock them together across the whole circuit, which can result in significant slowdowns if the result or the operands are used in other costly operations.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    min_max_strategy_preference=fhe.MinMaxStrategy.ONE_TLU_PROMOTED,
)

def f(x, y):
    return np.minimum(x, y)

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**2))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {

  // promotions          ............         ............
  func.func @main(%arg0: !FHE.eint<5>, %arg1: !FHE.eint<5>) -> !FHE.eint<5> {
  
    // subtraction
    %0 = "FHE.to_signed"(%arg0) : (!FHE.eint<5>) -> !FHE.esint<5>
    %1 = "FHE.to_signed"(%arg1) : (!FHE.eint<5>) -> !FHE.esint<5>
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<5>, !FHE.esint<5>) -> !FHE.esint<5>
    
    // tlu
    %cst = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]> : tensor<32xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst) : (!FHE.esint<5>, tensor<32xi64>) -> !FHE.eint<5>
    
    // addition
    %4 = "FHE.add_eint"(%3, %arg1) : (!FHE.eint<5>, !FHE.eint<5>) -> !FHE.eint<5>
    
    return %4 : !FHE.eint<5>
    
  }
  
}

2. fhe.MinMaxStrategy.THREE_TLU_CASTED

This strategy will not put any constraint on bit-widths during bit-width assignment. Instead, operands are cast to a bit-width that can store x - y during runtime using table lookups. The idea is:

uint3_to_uint7_lut = fhe.LookupTable([...])
x_cast_to_uint7 = uint3_to_uint7_lut[x]

uint6_to_uint7_lut = fhe.LookupTable([...])
y_cast_to_uint7 = uint6_to_uint7_lut[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_cast_to_uint7 - y_cast_to_uint7] + y

Notes

It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store x - y.
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store x - y.

Pros

It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.

Cons

If you are not doing anything else with the operands, or doing less costly operations compared to comparison, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.MinMaxStrategy.ONE_TLU_PROMOTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    min_max_strategy_preference=fhe.MinMaxStrategy.THREE_TLU_CASTED,
)

def f(x, y):
    return np.minimum(x, y)

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**2))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {

  // no promotions
  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<2>) -> !FHE.eint<2> {
  
    // casting x
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.esint<5>
    
    // casting y
    %cst_0 = arith.constant dense<[0, 1, 2, 3]> : tensor<4xi64>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst_0) : (!FHE.eint<2>, tensor<4xi64>) -> !FHE.esint<5>
    
    // subtraction
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<5>, !FHE.esint<5>) -> !FHE.esint<5>
    
    // tlu
    %cst_1 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -16, -15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]> : tensor<32xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.esint<5>, tensor<32xi64>) -> !FHE.eint<2>
    
    // addition
    %4 = "FHE.add_eint"(%3, %arg1) : (!FHE.eint<2>, !FHE.eint<2>) -> !FHE.eint<2>
    
    return %4 : !FHE.eint<2>
    
  }
  
}

Summary

Strategy

Minimum # of TLUs

Maximum # of TLUs

Can increase the bit-width of the inputs

Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.

Table Lookups advanced

This document details the management of Table Lookups(TLU) within Concrete for advanced usage. For a simpler guide, refer to the Table Lookup Basics.

Concrete automatically parallelizes TLUs if they are applied to tensors.

Direct table lookup

Concrete provides a LookupTable class to create your own tables and apply them in your circuits.

LookupTables can have any number of elements. Let's call the number of elements N. As long as the lookup variable is within the range [-N, N), the Table Lookup is valid.

If you go outside of this range, you will receive the following error:

IndexError: index 10 is out of bounds for axis 0 with size 6

With scalars.

You can create the lookup table using a list of integers and apply it using indexing:

from concrete import fhe

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = range(4)
circuit = f.compile(inputset)

assert circuit.encrypt_run_decrypt(0) == table[0] == 2
assert circuit.encrypt_run_decrypt(1) == table[1] == -1
assert circuit.encrypt_run_decrypt(2) == table[2] == 3
assert circuit.encrypt_run_decrypt(3) == table[3] == 0

With tensors.

When you apply a table lookup to a tensor, the scalar table lookup is applied to each element of the tensor:

from concrete import fhe
import numpy as np

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = [np.random.randint(0, 4, size=(2, 3)) for _ in range(10)]
circuit = f.compile(inputset)

sample = [
    [0, 1, 3],
    [2, 3, 1],
]
expected_output = [
    [2, -1, 0],
    [3, 0, -1],
]
actual_output = circuit.encrypt_run_decrypt(np.array(sample))

for i in range(2):
    for j in range(3):
        assert actual_output[i][j] == expected_output[i][j] == table[sample[i][j]]

With negative values.

LookupTable mimics array indexing in Python, which means if the lookup variable is negative, the table is looked up from the back:

from concrete import fhe

table = fhe.LookupTable([2, -1, 3, 0])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[-x]

inputset = range(1, 5)
circuit = f.compile(inputset)

assert circuit.encrypt_run_decrypt(1) == table[-1] == 0
assert circuit.encrypt_run_decrypt(2) == table[-2] == 3
assert circuit.encrypt_run_decrypt(3) == table[-3] == -1
assert circuit.encrypt_run_decrypt(4) == table[-4] == 2

Direct multi-table lookup

If you want to apply a different lookup table to each element of a tensor, you can have a LookupTable of LookupTables:

from concrete import fhe
import numpy as np

squared = fhe.LookupTable([i ** 2 for i in range(4)])
cubed = fhe.LookupTable([i ** 3 for i in range(4)])

table = fhe.LookupTable([
    [squared, cubed],
    [squared, cubed],
    [squared, cubed],
])

@fhe.compiler({"x": "encrypted"})
def f(x):
    return table[x]

inputset = [np.random.randint(0, 4, size=(3, 2)) for _ in range(10)]
circuit = f.compile(inputset)

sample = [
    [0, 1],
    [2, 3],
    [3, 0],
]
expected_output = [
    [0, 1],
    [4, 27],
    [9, 0]
]
actual_output = circuit.encrypt_run_decrypt(np.array(sample))

for i in range(3):
    for j in range(2):
        if j == 0:
            assert actual_output[i][j] == expected_output[i][j] == squared[sample[i][j]]
        else:
            assert actual_output[i][j] == expected_output[i][j] == cubed[sample[i][j]]

In this example, we applied a squared table to the first column and a cubed table to the second column.

Fused table lookup

Concrete tries to fuse some operations into table lookups automatically so that lookup tables don't need to be created manually:

from concrete import fhe
import numpy as np

@fhe.compiler({"x": "encrypted"})
def f(x):
    return (42 * np.sin(x)).astype(np.int64) // 10

inputset = range(8)
circuit = f.compile(inputset)

for x in range(8):
    assert circuit.encrypt_run_decrypt(x) == f(x)

All lookup tables need to be from integers to integers. So, without .astype(np.int64), Concrete will not be able to fuse.

The function is first traced into:

Concrete then fuses appropriate nodes:

Fusing makes the code more readable and easier to modify, so try to utilize it over manual LookupTables as much as possible.

Using automatically created table lookup

We refer the users to this page for explanations about fhe.univariate(function) and fhe.multivariate(function) features, which are convenient ways to use automatically created table lookup.

Table lookup exactness

TLUs are performed with an FHE operation called Programmable Bootstrapping (PBS). PBSs have a certain probability of error: when these errors happen, it results in inaccurate results.

Let's say you have the table:

lut = [0, 1, 4, 9, 16, 25, 36, 49, 64]

And you perform a Table Lookup using 4. The result you should get is lut[4] = 16, but because of the possibility of error, you could get any other value in the table.

If you set both of them, both will be satisfied. Essentially, the stricter one will be used.

By default, both p_error and global_p_error are set to None, which results in a global_p_error of 1 / 100_000 being used.

Feel free to play with these configuration options to pick the one best suited for your needs! See How to Configure to learn how you can set a custom p_error and/or global_p_error.

Table lookup performance

Bitwise operations

This document describes how comparisons are managed in Concrete, typically "AND", "OR", and so on. It covers different strategies to make the FHE computations faster, depending on the context.

Bitwise operations are not native operations in Concrete, so they need to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers two different implementations for performing bitwise operations.

Chunked

This is the most general implementation that can be used in any situation. The idea is:

# (example below is for bit-width of 8 and chunk size of 4)

# extract chunks of lhs using table lookups
lhs_chunks = [lhs.bits[0:4], lhs.bits[4:8]]

# extract chunks of rhs using table lookups
rhs_chunks = [rhs.bits[0:4], rhs.bits[4:8]]

# pack chunks of lhs and rhs using clear multiplications and additions 
packed_chunks = []
for lhs_chunk, rhs_chunk in zip(lhs_chunks, rhs_chunks):
    shifted_lhs_chunk = lhs_chunk * 2**4  # (i.e., lhs_chunk << 4)
    packed_chunks.append(shifted_lhs_chunk + rhs_chunk)

# apply comparison table lookup to packed chunks
bitwise_table = fhe.LookupTable([...])
result_chunks = bitwise_table[packed_chunks]

# sum resulting chunks obtain the result
result = np.sum(result_chunks)

Notes

Signed bitwise operations are not supported.
The optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked bitwise operations result in at least 4 and at most 9 table lookups.
It is used if no other implementation can be used.

Pros

Can be used with any integers.

Cons

Very expensive.

Example

import numpy as np
from concrete import fhe

def f(x, y):
    return x & y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, show_mlir=True)

produces

module {
  
  // no promotions
  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<4>) -> !FHE.eint<4> {

    // extracting the first chunk of x, adjusted for shifting
    %cst = arith.constant dense<[0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // extracting the first chunk of y
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]> : tensor<16xi64>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst_0) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // packing the first chunks
    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
        
    // applying the bitwise operation to the first chunks, adjusted for addition in the end
    %cst_1 = arith.constant dense<[0, 0, 0, 0, 0, 4, 0, 4, 0, 0, 8, 8, 0, 4, 8, 12]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // extracting the second chunk of x, adjusted for shifting
    %cst_2 = arith.constant dense<[0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12]> : tensor<16xi64>
    %4 = "FHE.apply_lookup_table"(%arg0, %cst_2) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // extracting the second chunk of y
    %cst_3 = arith.constant dense<[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]> : tensor<16xi64>
    %5 = "FHE.apply_lookup_table"(%arg1, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // packing the second chunks
    %6 = "FHE.add_eint"(%4, %5) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
        
    // applying the bitwise operation to second chunks
    %cst_4 = arith.constant dense<[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 2, 0, 1, 2, 3]> : tensor<16xi64>
    %7 = "FHE.apply_lookup_table"(%6, %cst_4) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
        
    // adding resulting chunks to obtain the result
    %8 = "FHE.add_eint"(%7, %3) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
        
    return %8 : !FHE.eint<4>

  }
  
}

Packing Trick

This implementation uses the fact that we can combine two values into a single value and apply a single table lookup to this combined value!

There are two major problems with this implementation:

packing requires the same bit-width across operands.
packing requires the bit-width of at least x.bit_width + y.bit_width and that bit-width cannot exceed maximum TLU bit-width, which is 16 at the moment.

What this means is if we are comparing uint3 and uint6, we need to convert both of them to uint9 in some way to do the packing and proceed with the TLU in 9-bits. There are 4 ways to achieve this behavior.

Requirements

x.bit_width + y.bit_width <= MAXIMUM_TLU_BIT_WIDTH

1. fhe.BitwiseStrategy.ONE_TLU_PROMOTED

bitwise_lut = fhe.LookupTable([...])
result = bitwise_lut[pack(x_promoted_to_uint9, y_promoted_to_uint9)]

Pros

It will always result in a single table lookup.

Cons

It will significantly increase the bit-width of both operands and lock them to each other across the whole circuit, which can result in significant slowdowns if the operands are used in other costly operations.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    bitwise_strategy_preference=fhe.BitwiseStrategy.ONE_TLU_PROMOTED,
)

def f(x, y):
    return x & y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions          ............         ............
  func.func @main(%arg0: !FHE.eint<8>, %arg1: !FHE.eint<8>) -> !FHE.eint<4> {
    
    // packing
    %c16_i9 = arith.constant 16 : i9
    %0 = "FHE.mul_eint_int"(%arg0, %c16_i9) : (!FHE.eint<8>, i9) -> !FHE.eint<8>
    %1 = "FHE.add_eint"(%0, %arg1) : (!FHE.eint<8>, !FHE.eint<8>) -> !FHE.eint<8>
        
    // computing the result
    %cst = arith.constant dense<"..."> : tensor<256xi64>
    %2 = "FHE.apply_lookup_table"(%1, %cst) : (!FHE.eint<8>, tensor<256xi64>) -> !FHE.eint<4>
        
    return %2 : !FHE.eint<4>
        
  }
  
}

2. fhe.BitwiseStrategy.THREE_TLU_CASTED

This strategy will not put any constraint on bit-widths during bit-width assignment, instead operands are cast to a bit-width that can store pack(x, y) during runtime using table lookups. The idea is:

uint3_to_uint9_lut = fhe.LookupTable([...])
x_cast_to_uint9 = uint3_to_uint9_lut[x]

uint6_to_uint9_lut = fhe.LookupTable([...])
y_cast_to_uint9 = uint6_to_uint9_lut[y]

bitwise_lut = fhe.LookupTable([...])
result = bitwise_lut[pack(x_cast_to_uint9, y_cast_to_uint9)]

Notes

It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store pack(x, y).
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store pack(x, y).

Pros

It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.

Cons

If you are not doing anything else with the operands, or doing less costly operations compared to bitwise, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.BitwiseStrategy.ONE_TLU_PROMOTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.BitwiseStrategy.THREE_TLU_CASTED,
)

def f(x, y):
    return x & y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // no promotions
  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<4>) -> !FHE.eint<4> {
    
    // casting
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<8>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<8>

    // packing
    %c16_i9 = arith.constant 16 : i9
    %2 = "FHE.mul_eint_int"(%0, %c16_i9) : (!FHE.eint<8>, i9) -> !FHE.eint<8>
    %3 = "FHE.add_eint"(%2, %1) : (!FHE.eint<8>, !FHE.eint<8>) -> !FHE.eint<8>
        
    // computing the result
    %cst_0 = arith.constant dense<"..."> : tensor<256xi64>
    %4 = "FHE.apply_lookup_table"(%3, %cst_0) : (!FHE.eint<8>, tensor<256xi64>) -> !FHE.eint<4>
        
    return %4 : !FHE.eint<4>
        
  }
  
}

3. fhe.BitwiseStrategy.TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED

This strategy can be viewed as a middle ground between the two strategies described above. With this strategy, only the bigger operand will be constrained to have at least the required bit-width to store pack(x, y), and the smaller operand will be cast to that bit-width during runtime. The idea is:

uint3_to_uint9_lut = fhe.LookupTable([...])
x_cast_to_uint9 = uint3_to_uint9_lut[x]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_cast_to_uint9 - y_promoted_to_uint9]

Notes

It can result in a single table lookup as well, if the smaller operand is assigned (because of other operations) the same bit-width as the bigger operand.

Pros

It will only put a constraint on the bigger operand, which is great if the smaller operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.

Cons

It will significantly increase the bit-width of the bigger operand which can result in significant slowdowns if the bigger operand is used in other costly operations.
If you are not doing anything else with the smaller operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    bitwise_strategy_preference=fhe.BitwiseStrategy.TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED,
)

def f(x, y):
    return x & y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**6))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions                               ............
  func.func @main(%arg0: !FHE.eint<3>, %arg1: !FHE.eint<8>) -> !FHE.eint<3> {
    
    // casting smaller operand
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7]> : tensor<8xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<8>
        
    // packing
    %c32_i9 = arith.constant 32 : i9
    %1 = "FHE.mul_eint_int"(%0, %c32_i9) : (!FHE.eint<8>, i9) -> !FHE.eint<8>
    %2 = "FHE.add_eint"(%1, %arg1) : (!FHE.eint<8>, !FHE.eint<8>) -> !FHE.eint<8>
        
    // computing the result
    %cst_0 = arith.constant dense<"..."> : tensor<256xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.eint<8>, tensor<256xi64>) -> !FHE.eint<3>
        
    return %3 : !FHE.eint<3>
        
  }
  
}

4. fhe.BitwiseStrategy.TWO_TLU_BIGGER_CASTED_SMALLER_PROMOTED

This strategy is like the exact opposite of the strategy above. With this, only the smaller operand will be constrained to have at least the required bit-width, and the bigger operand will be cast during runtime. The idea is:

uint6_to_uint9_lut = fhe.LookupTable([...])
y_cast_to_uint9 = uint6_to_uint9_lut[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_promoted_to_uint9 - y_cast_to_uint9]

Notes

It can result in a single table lookup as well, if the bigger operand is assigned (because of other operations) the same bit-width as the smaller operand.

Pros

It will only put constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.

Cons

It will increase the bit-width of the smaller operand which can result in significant slowdowns if the smaller operand is used in other costly operations.
If you are not doing anything else with the bigger operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    bitwise_strategy_preference=fhe.BitwiseStrategy.TWO_TLU_BIGGER_CASTED_SMALLER_PROMOTED,
)

def f(x, y):
    return x | y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**6))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions          ............
  func.func @main(%arg0: !FHE.eint<9>, %arg1: !FHE.eint<6>) -> !FHE.eint<6> {
    
    // casting bigger operand
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63]> : tensor<64xi64>
    %0 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<9>
        
    // packing
    %c64_i10 = arith.constant 64 : i10
    %1 = "FHE.mul_eint_int"(%arg0, %c64_i10) : (!FHE.eint<9>, i10) -> !FHE.eint<9>
    %2 = "FHE.add_eint"(%1, %0) : (!FHE.eint<9>, !FHE.eint<9>) -> !FHE.eint<9>
        
    // computing the result
    %cst_0 = arith.constant dense<"..."> : tensor<512xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.eint<9>, tensor<512xi64>) -> !FHE.eint<6>
        
    return %3 : !FHE.eint<6>

  }
  
}

Summary

Strategy

Minimum # of TLUs

Maximum # of TLUs

Can increase the bit-width of the inputs

Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.

Shifts

The same configuration option is used to modify the behavior of encrypted shift operations, and shifts are much more complex to implement, so we'll not go over the details. What is important is, the end the result is computed using additions or subtractions on the original shifted operand. Since additions and subtractions require the same bit-width across operands, input and output bit-widths need to be synchronized at some point. There are two ways to do this:

With promotion

Here, the shifted operand and shift result are assigned the same bit-width during bit-width assignment, which avoids an additional TLU on the shifted operand. On the other hand, it might increase the bit-width of the result or the shifted operand, and if they're used in other costly operations, it could result in significant slowdowns. This is the default behavior.

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    shifts_with_promotion=True,
)

def f(x, y):
    return x << y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**2))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions          ............
  func.func @main(%arg0: !FHE.eint<6>, %arg1: !FHE.eint<2>) -> !FHE.eint<6> {
    
    // shifting for the second bit of y
    %cst = arith.constant dense<[0, 0, 1, 1]> : tensor<4xi64>
    %0 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<2>, tensor<4xi64>) -> !FHE.eint<4>
    %cst_0 = arith.constant dense<[0, 0, 0, 2, 2, 2, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]> : tensor<64xi64>
    %1 = "FHE.apply_lookup_table"(%arg0, %cst_0) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %2 = "FHE.add_eint"(%1, %0) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_1 = arith.constant dense<[0, 0, 0, 8, 0, 16, 0, 24, 0, 32, 0, 40, 0, 48, 0, 56]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %cst_2 = arith.constant dense<[0, 6, 12, 2, 8, 14, 4, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]> : tensor<64xi64>
    %4 = "FHE.apply_lookup_table"(%arg0, %cst_2) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %5 = "FHE.add_eint"(%4, %0) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_3 = arith.constant dense<[0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7]> : tensor<16xi64>
    %6 = "FHE.apply_lookup_table"(%5, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %7 = "FHE.add_eint"(%3, %6) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
    %8 = "FHE.add_eint"(%7, %arg0) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
    
    // shifting for the first bit of y
    %cst_4 = arith.constant dense<[0, 1, 0, 1]> : tensor<4xi64>
    %9 = "FHE.apply_lookup_table"(%arg1, %cst_4) : (!FHE.eint<2>, tensor<4xi64>) -> !FHE.eint<4>
    %cst_5 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 14, 14, 14, 14, 14, 14, 14, 14]> : tensor<64xi64>
    %10 = "FHE.apply_lookup_table"(%8, %cst_5) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %11 = "FHE.add_eint"(%10, %9) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %12 = "FHE.apply_lookup_table"(%11, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %cst_6 = arith.constant dense<[0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14]> : tensor<64xi64>
    %13 = "FHE.apply_lookup_table"(%8, %cst_6) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %14 = "FHE.add_eint"(%13, %9) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %15 = "FHE.apply_lookup_table"(%14, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %16 = "FHE.add_eint"(%12, %15) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
    %17 = "FHE.add_eint"(%16, %8) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
        
    return %17 : !FHE.eint<6>
        
  }
  
}

With casting

The approach described above could be suboptimal for some circuits, so it is advised to check the complexity with it disabled before production. Here is how the implementation changes with it disabled.

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    shifts_with_promotion=False,
)

def f(x, y):
    return x << y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**2))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // no promotions
  func.func @main(%arg0: !FHE.eint<3>, %arg1: !FHE.eint<2>) -> !FHE.eint<6> {
    
    // shifting for the second bit of y
    %cst = arith.constant dense<[0, 0, 1, 1]> : tensor<4xi64>
    %0 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<2>, tensor<4xi64>) -> !FHE.eint<4>
    %cst_0 = arith.constant dense<[0, 0, 0, 2, 2, 2, 4, 4]> : tensor<8xi64>
    %1 = "FHE.apply_lookup_table"(%arg0, %cst_0) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    %2 = "FHE.add_eint"(%1, %0) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_1 = arith.constant dense<[0, 0, 0, 8, 0, 16, 0, 24, 0, 32, 0, 40, 0, 48, 0, 56]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %cst_2 = arith.constant dense<[0, 6, 12, 2, 8, 14, 4, 10]> : tensor<8xi64>
    %4 = "FHE.apply_lookup_table"(%arg0, %cst_2) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<4>
    %5 = "FHE.add_eint"(%4, %0) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %cst_3 = arith.constant dense<[0, 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7]> : tensor<16xi64>
    %6 = "FHE.apply_lookup_table"(%5, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %7 = "FHE.add_eint"(%3, %6) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
        
    // cast x to 6-bits to compute the result using addition/subtraction
    %cst_4 = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7]> : tensor<8xi64>
    %8 = "FHE.apply_lookup_table"(%arg0, %cst_4) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.eint<6>
    // this was done using promotion instead of casting in runtime when the flag was turned on
        
    %9 = "FHE.add_eint"(%7, %8) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
        
    // shifting for the first bit of y
    %cst_5 = arith.constant dense<[0, 1, 0, 1]> : tensor<4xi64>
    %10 = "FHE.apply_lookup_table"(%arg1, %cst_5) : (!FHE.eint<2>, tensor<4xi64>) -> !FHE.eint<4>
    %cst_6 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 14, 14, 14, 14, 14, 14, 14, 14]> : tensor<64xi64>
    %11 = "FHE.apply_lookup_table"(%9, %cst_6) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %12 = "FHE.add_eint"(%11, %10) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %13 = "FHE.apply_lookup_table"(%12, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %cst_7 = arith.constant dense<[0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14]> : tensor<64xi64>
    %14 = "FHE.apply_lookup_table"(%9, %cst_7) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.eint<4>
    %15 = "FHE.add_eint"(%14, %10) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    %16 = "FHE.apply_lookup_table"(%15, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<6>
    %17 = "FHE.add_eint"(%13, %16) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
    %18 = "FHE.add_eint"(%17, %9) : (!FHE.eint<6>, !FHE.eint<6>) -> !FHE.eint<6>
        
    return %18 : !FHE.eint<6>
  }
  
}

Comparisons

This document describes how comparisons are managed in Concrete, typically 'equal', 'greater than', and so on. It covers different strategies to make the FHE computations faster, depending on the context.

Comparisons are not native operations in Concrete, so they need to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers three different implementations for performing comparisons.

Chunked

This is the most general implementation that can be used in any situation. The idea is:

# (example below is for bit-width of 8 and chunk size of 4)

# extract chunks of lhs using table lookups
lhs_chunks = [lhs.bits[0:4], lhs.bits[4:8]]

# extract chunks of rhs using table lookups
rhs_chunks = [rhs.bits[0:4], rhs.bits[4:8]]

# pack chunks of lhs and rhs using clear multiplications and additions 
packed_chunks = []
for lhs_chunk, rhs_chunk in zip(lhs_chunks, rhs_chunks):
    shifted_lhs_chunk = lhs_chunk * 2**4  # (i.e., lhs_chunk << 4)
    packed_chunks.append(shifted_lhs_chunk + rhs_chunk)

# apply comparison table lookup to packed chunks
comparison_table = fhe.LookupTable([...])
chunk_comparisons = comparison_table[packed_chunks]

# reduce chunk comparisons to comparison of numbers
result = chunk_comparisons[0]
for chunk_comparison in chunk_comparisons[1:]:
    chunk_reduction_table = fhe.LookupTable([...])
    shifted_chunk_comparison= chunk_comparison * 2**2  # (i.e., lhs_chunk << 2)
    result = chunk_reduction_table[result + shifted_chunk_comparison]

Notes

Signed comparisons are more complex to explain, but they are supported!
The optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked comparisons result in at least 5 and at most 13 table lookups.
It is used if no other implementation can be used.
== and != are using a different chunk comparison and reduction strategy with less table lookups.

Pros

Can be used with any integers.

Cons

Very expensive.

Example

import numpy as np
from concrete import fhe

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, show_mlir=True)

produces

module {
  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<4>) -> !FHE.eint<1> {
  
    // extracting the first chunk of x, adjusted for shifting
    %cst = arith.constant dense<[0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // extracting the first chunk of y
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]> : tensor<16xi64>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst_0) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // packing first chunks
    %2 = "FHE.add_eint"(%0, %1) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    
    // comparing first chunks
    %cst_1 = arith.constant dense<[0, 1, 1, 1, 2, 0, 1, 1, 2, 2, 0, 1, 2, 2, 2, 0]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // extracting the second chunk of x, adjusted for shifting
    %cst_2 = arith.constant dense<[0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12]> : tensor<16xi64>
    %4 = "FHE.apply_lookup_table"(%arg0, %cst_2) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // extracting the second chunk of y
    %cst_3 = arith.constant dense<[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]> : tensor<16xi64>
    %5 = "FHE.apply_lookup_table"(%arg1, %cst_3) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // packing second chunks
    %6 = "FHE.add_eint"(%4, %5) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    
    // comparing second chunks
    %cst_4 = arith.constant dense<[0, 4, 4, 4, 8, 0, 4, 4, 8, 8, 0, 4, 8, 8, 8, 0]> : tensor<16xi64>
    %7 = "FHE.apply_lookup_table"(%6, %cst_4) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<4>
    
    // packing comparisons
    %8 = "FHE.add_eint"(%7, %3) : (!FHE.eint<4>, !FHE.eint<4>) -> !FHE.eint<4>
    
    // reducing comparisons to result
    %cst_5 = arith.constant dense<[0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]> : tensor<16xi64>
    %9 = "FHE.apply_lookup_table"(%8, %cst_5) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.eint<1>
    
    return %9 : !FHE.eint<1>
    
  }
}

Subtraction Trick

This implementation uses the fact that x [<,<=,==,!=,>=,>] y is equal to x - y [<,<=,==,!=,>=,>] 0, which is just a subtraction and a table lookup!

There are two major problems with this implementation:

subtraction before the TLU requires up to 2 additional bits to avoid overflows (it is 1 in most cases).
subtraction requires the same bit-width across operands.

What this means is if we are comparing uint3 and uint6, we need to convert both of them to uint7 in some way to do the subtraction and proceed with the TLU in 7-bits. There are 4 ways to achieve this behavior.

Requirements

(x - y).bit_width <= MAXIMUM_TLU_BIT_WIDTH

1. fhe.ComparisonStrategy.ONE_TLU_PROMOTED

This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the number of bits required to store x - y. The idea is:

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_promoted_to_uint7 - y_promoted_to_uint7]

Pros

It will always result in a single table lookup.

Cons

It will increase the bit-width of both operands and lock them to each other across the whole circuit, which can result in significant slowdowns if the operands are used in other costly operations.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.ONE_TLU_PROMOTED,
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  // promotions          ............         ............
  func.func @main(%arg0: !FHE.eint<5>, %arg1: !FHE.eint<5>) -> !FHE.eint<1> {
    
    // subtraction
    %0 = "FHE.to_signed"(%arg0) : (!FHE.eint<5>) -> !FHE.esint<5>
    %1 = "FHE.to_signed"(%arg1) : (!FHE.eint<5>) -> !FHE.esint<5>
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<5>, !FHE.esint<5>) -> !FHE.esint<5>
    
    // computing the result
    %cst = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<32xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst) : (!FHE.esint<5>, tensor<32xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

2. fhe.ComparisonStrategy.THREE_TLU_CASTED

This strategy will not put any constraint on bit-widths during bit-width assignment, instead operands are cast to a bit-width that can store x - y during runtime using table lookups. The idea is:

uint3_to_uint7_lut = fhe.LookupTable([...])
x_cast_to_uint7 = uint3_to_uint7_lut[x]

uint6_to_uint7_lut = fhe.LookupTable([...])
y_cast_to_uint7 = uint6_to_uint7_lut[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_cast_to_uint7 - y_cast_to_uint7]

Notes

It can result in a single table lookup, if x and y are assigned (because of other operations) the same bit-width and that bit-width can store x - y.
Alternatively, two table lookups can be used if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store x - y.

Pros

It will not put any constraints on the bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.

Cons

If you are not doing anything else with the operands, or doing less costly operations compared to comparison, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.ComparisonStrategy.ONE_TLU_PROMOTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.THREE_TLU_CASTED,
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**4), np.random.randint(0, 2**4))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // no promotions
  func.func @main(%arg0: !FHE.eint<3>, %arg1: !FHE.eint<6>) -> !FHE.eint<1> {
    
    // casting
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]> : tensor<16xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.esint<5>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<4>, tensor<16xi64>) -> !FHE.esint<5>
    
    // subtraction
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<5>, !FHE.esint<5>) -> !FHE.esint<5>
    
    // computing the result
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<32xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.esint<5>, tensor<32xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

3. fhe.ComparisonStrategy.TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED

This strategy can be seen as a middle ground between the two strategies described above. With this strategy, only the bigger operand will be constrained to have at least the required bit-width to store x - y, and the smaller operand will be cast to that bit-width during runtime. The idea is:

uint3_to_uint7_lut = fhe.LookupTable([...])
x_cast_to_uint7 = uint3_to_uint7_lut[x]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_cast_to_uint7 - y_promoted_to_uint7]

Notes

It can result in a single table lookup, if the smaller operand is assigned (because of other operations) the same bit-width as the bigger operand.

Pros

It will only put a constraint on the bigger operand, which is great if the smaller operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.

Cons

It will increase the bit-width of the bigger operand, which can result in significant slowdowns if the bigger operand is used in other costly operations.
If you are not doing anything else with the smaller operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.ComparisonStrategy.THREE_TLU_CASTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED,
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**5))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions                               ............
  func.func @main(%arg0: !FHE.eint<3>, %arg1: !FHE.eint<6>) -> !FHE.eint<1> {
    
    // casting the smaller operand
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7]> : tensor<8xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.esint<6>
    
    // subtraction
    %1 = "FHE.to_signed"(%arg1) : (!FHE.eint<6>) -> !FHE.esint<6>
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<6>, !FHE.esint<6>) -> !FHE.esint<6>
    
    // computing the result
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<64xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.esint<6>, tensor<64xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

4. fhe.ComparisonStrategy.TWO_TLU_BIGGER_CASTED_SMALLER_PROMOTED

This strategy can be seen as the exact opposite of the strategy above. With this, only the smaller operand will be constrained to have at least the required bit-width, and the bigger operand will be cast during runtime. The idea is:

uint6_to_uint7_lut = fhe.LookupTable([...])
y_cast_to_uint7 = uint6_to_uint7_lut[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_promoted_to_uint7 - y_cast_to_uint7]

Notes

It can result in a single table lookup, if the bigger operand is assigned (because of other operations) the same bit-width as the smaller operand.

Pros

It will only put a constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.

Cons

It will increase the bit-width of the smaller operand, which can result in significant slowdowns if the smaller operand is used in other costly operations.
If you are not doing anything else with the bigger operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.ComparisonStrategy.THREE_TLU_CASTED.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED,
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**5))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions          ............
  func.func @main(%arg0: !FHE.eint<6>, %arg1: !FHE.eint<5>) -> !FHE.eint<1> {
    
    // casting the bigger operand
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]> : tensor<32xi64>
    %0 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<5>, tensor<32xi64>) -> !FHE.esint<6>
    
    // subtraction
    %1 = "FHE.to_signed"(%arg0) : (!FHE.eint<6>) -> !FHE.esint<6>
    %2 = "FHE.sub_eint"(%1, %0) : (!FHE.esint<6>, !FHE.esint<6>) -> !FHE.esint<6>
    
    // computing the result
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<64xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.esint<6>, tensor<64xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

Clipping Trick

This implementation uses the fact that the subtraction trick is not optimal in terms of the required intermediate bit width. The comparison result does not change if we compare(3, 40) or compare(3, 4), so why not clipping the bigger operand and then doing the subtraction to use less bits!

There are two major problems with this implementation:

it can not be used when the bit-widths are the same (for some cases even when they differ by only one bit)
subtraction still requires the same bit-width across operands.

What this means is if we are comparing uint3 and uint6, we need to convert both of them to uint4 in some way to do the subtraction and proceed with the TLU in 7-bits. There are 2 ways to achieve this behavior.

Requirements

```
x.bit_width != y.bit_width
```

smaller = x if x.bit_width < y.bit_width else y
bigger = x if x.bit_width > y.bit_width else y
clipped = lambda value: np.clip(value, smaller.min() - 1, smaller.max() + 1)
any(
    (
        bit_width <= MAXIMUM_TLU_BIT_WIDTH and
        bit_width <= bigger.dtype.bit_width and
        bit_width > smaller.dtype.bit_width
    )
    for bit_width in [
        (smaller - clipped(bigger)).bit_width,
        (clipped(bigger) - smaller).bit_width,
    ]
  )

1. fhe.ComparisonStrategy.THREE_TLU_BIGGER_CLIPPED_SMALLER_CASTED

This strategy will not put any constraint on bit-widths during bit-width assignment, instead the smaller operand is cast to a bit-width that can store clipped(bigger) - smaller or smaller - clipped(bigger) during runtime using table lookups. The idea is:

uint3_to_uint4_lut = fhe.LookupTable([...])
x_cast_to_uint4 = uint3_to_uint4_lut[x]

clipper = fhe.LookupTable([...])
y_clipped = clipper[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_cast_to_uint4 - y_clipped]
# or
another_comparison_lut = fhe.LookupTable([...])
result = another_comparison_lut[y_clipped - x_cast_to_uint4]

Notes

This is a fallback implementation, so if there is a difference of 1-bit (or in some cases 2-bits) and the subtraction trick cannot be used optimally, this implementation will be used instead of fhe.ComparisonStrategy.CHUNKED.
It can result in two table lookups if the smaller operand is assigned a bit-width bigger than or equal to the bit width that can store clipped(bigger) - smaller or smaller - clipped(bigger).

Pros

It will not put any constraints on the bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
These table lookups will be on smaller bit-widths, which is great.

Cons

Cannot be used to compare integers with the same bit-width, which is very common.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.THREE_TLU_BIGGER_CLIPPED_SMALLER_CASTED
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**6))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // no promotions
  func.func @main(%arg0: !FHE.eint<3>, %arg1: !FHE.eint<6>) -> !FHE.eint<1> {
    
    // casting the smaller operand 
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7]> : tensor<8xi64>
    %0 = "FHE.apply_lookup_table"(%arg0, %cst) : (!FHE.eint<3>, tensor<8xi64>) -> !FHE.esint<4>
    
    // clipping the bigger operand
    %cst_0 = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]> : tensor<64xi64>
    %1 = "FHE.apply_lookup_table"(%arg1, %cst_0) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.esint<4>
    
    // subtraction
    %2 = "FHE.sub_eint"(%0, %1) : (!FHE.esint<4>, !FHE.esint<4>) -> !FHE.esint<4>
    
    // computing the result
    %cst_1 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_1) : (!FHE.esint<4>, tensor<16xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

2. fhe.ComparisonStrategy.TWO_TLU_BIGGER_CLIPPED_SMALLER_PROMOTED

This strategy is similar to the strategy described above. The difference is that with this strategy, the smaller operand will be constrained to have at least the required bit-width to store clipped(bigger) - smaller or smaller - clipped(bigger). The bigger operand will still be clipped to that bit-width during runtime. The idea is:

clipper = fhe.LookupTable([...])
y_clipped = clipper[y]

comparison_lut = fhe.LookupTable([...])
result = comparison_lut[x_promoted_to_uint4 - y_clipped]
# or
another_comparison_lut = fhe.LookupTable([...])
result = another_comparison_lut[y_clipped - x_promoted_to_uint4]

Pros

It will only put a constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in exactly 2 table lookups, which is great.

Cons

It will increase the bit-width of the bigger operand, which can result in significant slowdowns if the bigger operand is used in other costly operations.

Example

import numpy as np
from concrete import fhe

configuration = fhe.Configuration(
    comparison_strategy_preference=fhe.ComparisonStrategy.TWO_TLU_BIGGER_CLIPPED_SMALLER_PROMOTED
)

def f(x, y):
    return x < y

inputset = [
    (np.random.randint(0, 2**3), np.random.randint(0, 2**6))
    for _ in range(100)
]

compiler = fhe.Compiler(f, {"x": "encrypted", "y": "encrypted"})
circuit = compiler.compile(inputset, configuration, show_mlir=True)

produces

module {
  
  // promotions          ............
  func.func @main(%arg0: !FHE.eint<4>, %arg1: !FHE.eint<6>) -> !FHE.eint<1> {
    
    // clipping the bigger operand
    %cst = arith.constant dense<[0, 1, 2, 3, 4, 5, 6, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8]> : tensor<64xi64>
    %0 = "FHE.apply_lookup_table"(%arg1, %cst) : (!FHE.eint<6>, tensor<64xi64>) -> !FHE.esint<4>
    
    // subtraction
    %1 = "FHE.to_signed"(%arg0) : (!FHE.eint<4>) -> !FHE.esint<4>
    %2 = "FHE.sub_eint"(%1, %0) : (!FHE.esint<4>, !FHE.esint<4>) -> !FHE.esint<4>
        
    // computing the result
    %cst_0 = arith.constant dense<[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]> : tensor<16xi64>
    %3 = "FHE.apply_lookup_table"(%2, %cst_0) : (!FHE.esint<4>, tensor<16xi64>) -> !FHE.eint<1>
    
    return %3 : !FHE.eint<1>
    
  }
  
}

Summary

Strategy

Minimum # of TLUs

Maximum # of TLUs

Can increase the bit-width of the inputs

Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.