Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This document details the concept of rounding, and how it is used in Concrete to make some FHE computations especially faster.
Table lookups have a strict constraint on the number of bits they support. This can be limiting, especially if you don't need exact precision. As well as this, using larger bit-widths leads to slower table lookups.
To overcome these issues, rounded table lookups are introduced. This operation provides a way to round the least significant bits of a large integer and then apply the table lookup on the resulting (smaller) value.
Imagine you have a 5-bit value, but you want to have a 3-bit table lookup. You can call fhe.round_bit_pattern(input, lsbs_to_remove=2)
and use the 3-bit value you receive as input to the table lookup.
Let's see how rounding works in practice:
prints:
and displays:
If the rounded number is one of the last 2**(lsbs_to_remove - 1)
numbers in the input range [0, 2**original_bit_width)
, an overflow will happen.
By default, if an overflow is encountered during inputset evaluation, bit-widths will be adjusted accordingly. This results in a loss of speed, but ensures accuracy.
You can turn this overflow protection off (e.g., for performance) by using fhe.round_bit_pattern(..., overflow_protection=False)
. However, this could lead to unexpected behavior at runtime.
Now, let's see how rounding can be used in FHE.
prints:
These speed-ups can vary from system to system.
The reason why the speed-up is not increasing with lsbs_to_remove
is because the rounding operation itself has a cost: each bit removal is a PBS. Therefore, if a lot of bits are removed, rounding itself could take longer than the bigger TLU which is evaluated afterwards.
and displays:
Feel free to disable overflow protection and see what happens.
Rounding is very useful but, in some cases, you don't know how many bits your input contains, so it's not reliable to specify lsbs_to_remove
manually. For this reason, the AutoRounder
class is introduced.
AutoRounder
allows you to set how many of the most significant bits to keep, but they need to be adjusted using an inputset to determine how many of the least significant bits to remove. This can be done manually using fhe.AutoRounder.adjust(function, inputset)
, or by setting auto_adjust_rounders
configuration to True
during compilation.
Here is how auto rounders can be used in FHE:
prints:
and displays:
AutoRounder
s should be defined outside the function that is being compiled. They are used to store the result of the adjustment process, so they shouldn't be created each time the function is called. Furthermore, each AutoRounder
should be used with exactly one round_bit_pattern
call.
One use of rounding is doing faster computation by ignoring the lower significant bits. For this usage, you can even get faster results if you accept the rounding it-self to be slightly inexact. The speedup is usually around 2x-3x but can be higher for big precision reduction. This also enable higher precisions values that are not possible otherwise.
You can turn on this mode either globally on the configuration:
or on/off locally:
In approximate mode the rounding threshold up or down is not perfectly centered: The off-centering is:
is bounded, i.e. at worst an off-by-one on the reduced precision value compared to the exact result,
is pseudo-random, i.e. it will be different on each call,
almost symmetrically distributed,
depends on cryptographic properties like the encryption mask, the encryption noise and the crypto-parameters.
With approximate rounding, you can enable an approximate clipping to get further improve performance in the case of overflow handling. Approximate clipping enable to discard the extra bit of overflow protection bit in the successor TLU. For consistency a logical clipping is available when this optimization is not suitable.
When fast approximate clipping is not suitable (i.e. slower), it's better to apply logical clipping for consistency and better resilience to code change. It has no extra cost since it's fuzed with the successor TLU.
This set the first precision where approximate clipping is enabled, starting from this precision, an extra small precision TLU is introduced to safely remove the extra precision bit used to contain overflow. This way the successor TLU is faster. E.g. for a rounding to 7bits, that finishes to a TLU of 8bits due to overflow, forcing to use a TLU of 7bits is 3x faster.
This document explains how to compute minimum and maximum between values in Concrete, covering different strategies to make computations faster, depending on the strategy.
Finding the minimum or maximum of two numbers is not a native operation in Concrete, so it needs to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers two different implementations for this.
This is the most general implementation that can be used in any situation. The idea is:
Multiplication with operands aren't allowed to increase the bit-width of the inputs, so they are very expensive as well.
Optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked comparisons result in at least 9 and at most 21 table lookups.
It is used if no other implementation can be used.
Can be used with any integers.
Extremely expensive.
produces
This implementation uses the fact that [min,max](x, y)
is equal to [min, max](x - y, 0) + y
, which is just a subtraction, a table lookup and an addition!
There are two major problems with this implementation though:
subtraction before the TLU requires up to 2 additional bits to avoid overflows (it is 1 in most cases).
subtraction and addition require the same bit-width across operands.
What this means is that if we are comparing uint3
and uint6
, we need to convert both of them to uint7
in some way to do the subtraction and proceed with the TLU in 7-bits. There are 2 ways to achieve this behavior.
This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the amount of bits required to store x - y
. The idea is:
It will always result in a single table lookup.
It will increase the bit-width of both operands and the result, and lock them together across the whole circuit, which can result in significant slowdowns if the result or the operands are used in other costly operations.
produces
This strategy will not put any constraint on bit-widths during bit-width assignment. Instead, operands are cast to a bit-width that can store x - y
during runtime using table lookups. The idea is:
It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store x - y
.
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store x - y
.
It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
If you are not doing anything else with the operands, or doing less costly operations compared to comparison, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.MinMaxStrategy.ONE_TLU_PROMOTED
.
produces
Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.
Different strategies are good for different circuits. If you want the best runtime for your use case, you can compile your circuit with all different comparison strategy preferences, and pick the one with the lowest complexity.
Initial comparison is as well, which is already very expensive.
Strategy | Minimum # of TLUs | Maximum # of TLUs | Can increase the bit-width of the inputs |
---|
CHUNKED | 9 | 21 |
ONE_TLU_PROMOTED | 1 | 1 | ✓ |
THREE_TLU_CASTED | 1 | 3 |
This document describes how floating points are treated and manipulated in Concrete.
Concrete partly supports floating points. There is no support for floating point inputs or outputs. However, there is support for intermediate values to be floating points (under certain constraints). Also, we note that one can use an equivalent of fixed points in Concrete, as described in our tutorial.
Concrete-Compile, which is used for compiling the circuit, doesn't support floating points at all. However, it supports table lookups which take an integer and map it to another integer. The constraints of this operation are that there should be a single integer input, and a single integer output.
As long as your floating point operations comply with those constraints, Concrete automatically converts them to a table lookup operation:
In the example above, a
, b
, and c
are floating point intermediates. They are used to calculate d
, which is an integer with a value dependent upon x
, which is also an integer. Concrete detects this and fuses all of these operations into a single table lookup from x
to d
.
This approach works for a variety of use cases, but it comes up short for others:
This results in:
The reason for the error is that d
no longer depends solely on x
; it depends on y
as well. Concrete cannot fuse these operations, so it raises an exception instead.
This document explains the concept of direct circuits in Concrete, which is another way to compile circuit without having to give a proper inputset.
Direct circuits are still experimental. It is very easy to make mistakes (e.g., due to no overflow checks or type coercion) while using direct circuits, so utilize them with care.
For some applications, the data types of inputs, intermediate values, and outputs are known (e.g., for manipulating bytes, you would want to use uint8). Using inputsets to determine bounds in these cases is not necessary, and can even be error-prone. Therefore, another interface for defining such circuits is introduced:
There are a few differences between direct circuits and traditional circuits:
Remember that the resulting dtype for each operation will be determined by its inputs. This can lead to some unexpected results if you're not careful (e.g., if you do -x
where x: fhe.uint8
, you won't receive a negative value as the result will be fhe.uint8
as well)
There is no inputset evaluation when using fhe types in .astype(...)
calls (e.g., np.sqrt(x).astype(fhe.uint4)
), so the bit width of the output cannot be determined.
Specify the resulting data type in univariate extension (e.g., fhe.univariate(function, outputs=fhe.uint4)(x)
), for the same reason as above.
Be careful with overflows. With inputset evaluation, you'll get bigger bit widths but no overflows. With direct definition, you must ensure that there aren't any overflows manually.
Let's review a more complicated example to see how direct circuits behave:
This prints:
Here is the breakdown of the assigned data types:
As you can see, %8
is subtraction of two unsigned values, and the result is unsigned as well. In the case that c > d
, we have an overflow, and this results in undefined behavior.
*Using the default configuration in approximate mode. For 3, 4, 5 and 6 reduced precision bits and accumulator precision up to 32bits
In blue the exact value, the red dots are approximate values due to off-centered transition in approximate mode.
Histogram of transitions off-centering delta. Each count correspond to a specific random mask and a specific encryption noise.
Only the last step is clipped.
The last steps are decreased.
This document details the management of Table Lookups(TLU) within Concrete for advanced usage. For a simpler guide, refer to the Table Lookup Basics.
One of the most common operations in Concrete are Table Lookups
(TLUs). All operations except addition, subtraction, multiplication with non-encrypted values, tensor manipulation operations, and a few operations built with those primitive operations (e.g. matmul, conv) are converted to Table Lookups under the hood.
Table Lookups are very flexible. They allow Concrete to support many operations, but they are expensive. The exact cost depends on many variables (hardware used, error probability, etc.), but they are always much more expensive compared to other operations. You should try to avoid them as much as possible. It's not always possible to avoid them completely, but you might remove the number of TLUs or replace some of them with other primitive operations.
Concrete automatically parallelizes TLUs if they are applied to tensors.
Concrete provides a LookupTable
class to create your own tables and apply them in your circuits.
LookupTable
s can have any number of elements. Let's call the number of elements N. As long as the lookup variable is within the range [-N, N), the Table Lookup is valid.
If you go outside of this range, you will receive the following error:
You can create the lookup table using a list of integers and apply it using indexing:
When you apply a table lookup to a tensor, the scalar table lookup is applied to each element of the tensor:
LookupTable
mimics array indexing in Python, which means if the lookup variable is negative, the table is looked up from the back:
If you want to apply a different lookup table to each element of a tensor, you can have a LookupTable
of LookupTable
s:
In this example, we applied a squared
table to the first column and a cubed
table to the second column.
Concrete tries to fuse some operations into table lookups automatically so that lookup tables don't need to be created manually:
All lookup tables need to be from integers to integers. So, without .astype(np.int64)
, Concrete will not be able to fuse.
The function is first traced into:
Concrete then fuses appropriate nodes:
Fusing makes the code more readable and easier to modify, so try to utilize it over manual LookupTable
s as much as possible.
We refer the users to this page for explanations about fhe.univariate(function)
and fhe.multivariate(function)
features, which are convenient ways to use automatically created table lookup.
TLUs are performed with an FHE operation called Programmable Bootstrapping
(PBS). PBSs have a certain probability of error: when these errors happen, it results in inaccurate results.
Let's say you have the table:
And you perform a Table Lookup using 4
. The result you should get is lut[4] = 16
, but because of the possibility of error, you could get any other value in the table.
The probability of this error can be configured through the p_error
and global_p_error
configuration options. The difference between these two options is that, p_error
is for individual TLUs but global_p_error
is for the whole circuit.
If you set p_error
to 0.01
, for example, it means every TLU in the circuit will have a 99% chance (or more) of being exact. If there is a single TLU in the circuit, it corresponds to global_p_error = 0.01
as well. But if we have 2 TLUs, then global_p_error
would be higher: that's 1 - (0.99 * 0.99) ~= 0.02 = 2%
.
If you set global_p_error
to 0.01
, the whole circuit will have at most 1% probability of error, no matter how many Table Lookups are included (which means that p_error
will be smaller than 0.01
if there are more than a single TLU).
If you set both of them, both will be satisfied. Essentially, the stricter one will be used.
By default, both p_error
and global_p_error
are set to None
, which results in a global_p_error
of 1 / 100_000
being used.
Feel free to play with these configuration options to pick the one best suited for your needs! See How to Configure to learn how you can set a custom p_error
and/or global_p_error
.
Configuring either of those variables impacts compilation and execution times (compilation, keys generation, circuit execution) and space requirements (size of the keys on disk and in memory). Lower error probabilities result in longer compilation and execution times and larger space requirements.
PBSs are very expensive, in terms of computations. Fortunately, it is sometimes possible to replace PBS by rounded PBS, truncate PBS or even approximate PBS. These TLUs have a slightly different semantic, but are very useful in cases like machine learning for more efficiency without drop of accuracy.
This document details the concept of truncating, and how it is used in Concrete to make some FHE computations especially faster.
Table lookups have a strict constraint on the number of bits they support. This can be limiting, especially if you don't need exact precision. As well as this, using larger bit-widths leads to slower table lookups.
To overcome these issues, truncated table lookups are introduced. This operation provides a way to zero the least significant bits of a large integer and then apply the table lookup on the resulting (smaller) value.
Imagine you have a 5-bit value, you can use fhe.truncate_bit_pattern(value, lsbs_to_remove=2)
to truncate it (here the last 2 bits are discarded). Once truncated, value will remain in 5-bits (e.g., 22 = 0b10110 would be truncated to 20 = 0b10100), and the last 2 bits of it would be zero. Concrete uses this to optimize table lookups on the truncated value, the 5-bit table lookup gets optimized to a 3-bit table lookup, which is much faster!
Let's see how truncation works in practice:
prints:
and displays:
Now, let's see how truncating can be used in FHE.
prints:
These speed-ups can vary from system to system.
The reason why the speed-up is not increasing with lsbs_to_remove
is because the truncating operation itself has a cost: each bit removal is a PBS. Therefore, if a lot of bits are removed, truncation itself could take longer than the bigger TLU which is evaluated afterwards.
and displays:
Truncating is very useful but, in some cases, you don't know how many bits your input contains, so it's not reliable to specify lsbs_to_remove
manually. For this reason, the AutoTruncator
class is introduced.
AutoTruncator
allows you to set how many of the most significant bits to keep, but they need to be adjusted using an inputset to determine how many of the least significant bits to remove. This can be done manually using fhe.AutoTruncator.adjust(function, inputset)
, or by setting auto_adjust_truncators
configuration to True
during compilation.
Here is how auto truncators can be used in FHE:
prints:
and displays:
AutoTruncator
s should be defined outside the function that is being compiled. They are used to store the result of the adjustment process, so they shouldn't be created each time the function is called. Furthermore, each AutoTruncator
should be used with exactly one truncate_bit_pattern
call.
This document describes how comparisons are managed in Concrete, typically "AND", "OR", and so on. It covers different strategies to make the FHE computations faster, depending on the context.
Bitwise operations are not native operations in Concrete, so they need to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers two different implementations for performing bitwise operations.
This is the most general implementation that can be used in any situation. The idea is:
Signed bitwise operations are not supported.
The optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked bitwise operations result in at least 4 and at most 9 table lookups.
It is used if no other implementation can be used.
Can be used with any integers.
Very expensive.
produces
This implementation uses the fact that we can combine two values into a single value and apply a single table lookup to this combined value!
There are two major problems with this implementation:
packing requires the same bit-width across operands.
packing requires the bit-width of at least x.bit_width + y.bit_width
and that bit-width cannot exceed maximum TLU bit-width, which is 16
at the moment.
What this means is if we are comparing uint3
and uint6
, we need to convert both of them to uint9
in some way to do the packing and proceed with the TLU in 9-bits. There are 4 ways to achieve this behavior.
This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the amount of bits required to store pack(x, y)
. The idea is:
It will always result in a single table lookup.
It will significantly increase the bit-width of both operands and lock them to each other across the whole circuit, which can result in significant slowdowns if the operands are used in other costly operations.
produces
This strategy will not put any constraint on bit-widths during bit-width assignment, instead operands are cast to a bit-width that can store pack(x, y)
during runtime using table lookups. The idea is:
It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store pack(x, y)
.
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store pack(x, y)
.
It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
If you are not doing anything else with the operands, or doing less costly operations compared to bitwise, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.BitwiseStrategy.ONE_TLU_PROMOTED
.
produces
This strategy can be viewed as a middle ground between the two strategies described above. With this strategy, only the bigger operand will be constrained to have at least the required bit-width to store pack(x, y)
, and the smaller operand will be cast to that bit-width during runtime. The idea is:
It can result in a single table lookup as well, if the smaller operand is assigned (because of other operations) the same bit-width as the bigger operand.
It will only put a constraint on the bigger operand, which is great if the smaller operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
It will significantly increase the bit-width of the bigger operand which can result in significant slowdowns if the bigger operand is used in other costly operations.
If you are not doing anything else with the smaller operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED
.
produces
This strategy is like the exact opposite of the strategy above. With this, only the smaller operand will be constrained to have at least the required bit-width, and the bigger operand will be cast during runtime. The idea is:
It can result in a single table lookup as well, if the bigger operand is assigned (because of other operations) the same bit-width as the smaller operand.
It will only put constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
It will increase the bit-width of the smaller operand which can result in significant slowdowns if the smaller operand is used in other costly operations.
If you are not doing anything else with the bigger operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED
.
produces
Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.
Different strategies are good for different circuits. If you want the best runtime for your use case, you can compile your circuit with all different comparison strategy preferences, and pick the one with the lowest complexity.
The same configuration option is used to modify the behavior of encrypted shift operations, and shifts are much more complex to implement, so we'll not go over the details. What is important is, the end the result is computed using additions or subtractions on the original shifted operand. Since additions and subtractions require the same bit-width across operands, input and output bit-widths need to be synchronized at some point. There are two ways to do this:
Here, the shifted operand and shift result are assigned the same bit-width during bit-width assignment, which avoids an additional TLU on the shifted operand. On the other hand, it might increase the bit-width of the result or the shifted operand, and if they're used in other costly operations, it could result in significant slowdowns. This is the default behavior.
produces
The approach described above could be suboptimal for some circuits, so it is advised to check the complexity with it disabled before production. Here is how the implementation changes with it disabled.
produces
This document describes how comparisons are managed in Concrete, typically 'equal', 'greater than', and so on. It covers different strategies to make the FHE computations faster, depending on the context.
Comparisons are not native operations in Concrete, so they need to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers three different implementations for performing comparisons.
This is the most general implementation that can be used in any situation. The idea is:
Signed comparisons are more complex to explain, but they are supported!
The optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked comparisons result in at least 5 and at most 13 table lookups.
It is used if no other implementation can be used.
==
and !=
are using a different chunk comparison and reduction strategy with less table lookups.
Can be used with any integers.
Very expensive.
produces
This implementation uses the fact that x [<,<=,==,!=,>=,>] y
is equal to x - y [<,<=,==,!=,>=,>] 0
, which is just a subtraction and a table lookup!
There are two major problems with this implementation:
subtraction before the TLU requires up to 2 additional bits to avoid overflows (it is 1 in most cases).
subtraction requires the same bit-width across operands.
What this means is if we are comparing uint3
and uint6
, we need to convert both of them to uint7
in some way to do the subtraction and proceed with the TLU in 7-bits. There are 4 ways to achieve this behavior.
This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the number of bits required to store x - y
. The idea is:
It will always result in a single table lookup.
It will increase the bit-width of both operands and lock them to each other across the whole circuit, which can result in significant slowdowns if the operands are used in other costly operations.
produces
This strategy will not put any constraint on bit-widths during bit-width assignment, instead operands are cast to a bit-width that can store x - y
during runtime using table lookups. The idea is:
It can result in a single table lookup, if x and y are assigned (because of other operations) the same bit-width and that bit-width can store x - y
.
Alternatively, two table lookups can be used if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store x - y
.
It will not put any constraints on the bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
If you are not doing anything else with the operands, or doing less costly operations compared to comparison, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.ComparisonStrategy.ONE_TLU_PROMOTED
.
produces
This strategy can be seen as a middle ground between the two strategies described above. With this strategy, only the bigger operand will be constrained to have at least the required bit-width to store x - y
, and the smaller operand will be cast to that bit-width during runtime. The idea is:
It can result in a single table lookup, if the smaller operand is assigned (because of other operations) the same bit-width as the bigger operand.
It will only put a constraint on the bigger operand, which is great if the smaller operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
It will increase the bit-width of the bigger operand, which can result in significant slowdowns if the bigger operand is used in other costly operations.
If you are not doing anything else with the smaller operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.ComparisonStrategy.THREE_TLU_CASTED
.
produces
This strategy can be seen as the exact opposite of the strategy above. With this, only the smaller operand will be constrained to have at least the required bit-width, and the bigger operand will be cast during runtime. The idea is:
It can result in a single table lookup, if the bigger operand is assigned (because of other operations) the same bit-width as the smaller operand.
It will only put a constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
It will increase the bit-width of the smaller operand, which can result in significant slowdowns if the smaller operand is used in other costly operations.
If you are not doing anything else with the bigger operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.ComparisonStrategy.THREE_TLU_CASTED
.
produces
This implementation uses the fact that the subtraction trick is not optimal in terms of the required intermediate bit width. The comparison result does not change if we compare(3, 40)
or compare(3, 4)
, so why not clipping the bigger operand and then doing the subtraction to use less bits!
There are two major problems with this implementation:
it can not be used when the bit-widths are the same (for some cases even when they differ by only one bit)
subtraction still requires the same bit-width across operands.
What this means is if we are comparing uint3
and uint6
, we need to convert both of them to uint4
in some way to do the subtraction and proceed with the TLU in 7-bits. There are 2 ways to achieve this behavior.
This strategy will not put any constraint on bit-widths during bit-width assignment, instead the smaller operand is cast to a bit-width that can store clipped(bigger) - smaller
or smaller - clipped(bigger)
during runtime using table lookups. The idea is:
This is a fallback implementation, so if there is a difference of 1-bit (or in some cases 2-bits) and the subtraction trick cannot be used optimally, this implementation will be used instead of fhe.ComparisonStrategy.CHUNKED
.
It can result in two table lookups if the smaller operand is assigned a bit-width bigger than or equal to the bit width that can store clipped(bigger) - smaller
or smaller - clipped(bigger)
.
It will not put any constraints on the bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
These table lookups will be on smaller bit-widths, which is great.
Cannot be used to compare integers with the same bit-width, which is very common.
produces
This strategy is similar to the strategy described above. The difference is that with this strategy, the smaller operand will be constrained to have at least the required bit-width to store clipped(bigger) - smaller
or smaller - clipped(bigger)
. The bigger operand will still be clipped to that bit-width during runtime. The idea is:
It will only put a constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in exactly 2 table lookups, which is great.
It will increase the bit-width of the bigger operand, which can result in significant slowdowns if the bigger operand is used in other costly operations.
produces
Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.
Different strategies are good for different circuits. If you want the best runtime for your use case, you can compile your circuit with all different comparison strategy preferences, and pick the one with the lowest complexity.
This document explains the concept of tagging, which is a debugging tool to make a link between the user's Python code and the Concrete MLIR circuits. Such a link can be useful when an issue is raised by the compiler on some MLIR, to know which Python code it corresponds to.
When you have big circuits, keeping track of which node corresponds to which part of your code becomes difficult. A tagging system can simplify such situations:
When you compile f
with inputset of range(10)
, you get the following graph:
If you get an error, you'll see exactly where the error occurred (e.g., which layer of the neural network, if you tag layers).
In the future, we plan to use tags for additional features (e.g., to measure performance of tagged regions), so it's a good idea to start utilizing them for big circuits.
Strategy | Minimum # of TLUs | Maximum # of TLUs | Can increase the bit-width of the inputs |
---|---|---|---|
Strategy | Minimum # of TLUs | Maximum # of TLUs | Can increase the bit-width of the inputs |
---|---|---|---|
CHUNKED
4
9
ONE_TLU_PROMOTED
1
1
✓
THREE_TLU_CASTED
1
3
TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED
1
2
✓
TWO_TLU_BIGGER_CASTED_SMALLER_PROMOTED
1
2
✓
CHUNKED
5
13
ONE_TLU_PROMOTED
1
1
✓
THREE_TLU_CASTED
1
3
TWO_TLU_BIGGER_PROMOTED_SMALLER_CASTED
1
2
✓
TWO_TLU_BIGGER_CASTED_SMALLER_PROMOTED
1
2
✓
THREE_TLU_BIGGER_CLIPPED_SMALLER_CASTED
2
3
TWO_TLU_BIGGER_CLIPPED_SMALLER_PROMOTED
2
2
✓