Bitwise operations are not native operations in Concrete, so they need to be implemented using existing native operations (i.e., additions, clear multiplications, negations, table lookups). Concrete offers two different implementations for performing bitwise operations.
Chunked
This is the most general implementation that can be used in any situation. The idea is:
# (example below is for bit-width of 8 and chunk size of 4)# extract chunks of lhs using table lookupslhs_chunks = [lhs.bits[0:4], lhs.bits[4:8]]# extract chunks of rhs using table lookupsrhs_chunks = [rhs.bits[0:4], rhs.bits[4:8]]# pack chunks of lhs and rhs using clear multiplications and additions packed_chunks = []for lhs_chunk, rhs_chunk inzip(lhs_chunks, rhs_chunks): shifted_lhs_chunk = lhs_chunk *2**4# (i.e., lhs_chunk << 4) packed_chunks.append(shifted_lhs_chunk + rhs_chunk)# apply comparison table lookup to packed chunksbitwise_table = fhe.LookupTable([...])result_chunks = bitwise_table[packed_chunks]# sum resulting chunks obtain the resultresult = np.sum(result_chunks)
Notes
Signed bitwise operations are not supported.
Optimal chunk size is selected automatically to reduce the number of table lookups.
Chunked bitwise operations result in at least 4 and at most 9 table lookups.
It is used if no other implementation can be used.
module { // no promotionsfunc.func @main(%arg0: !FHE.eint<4>,%arg1: !FHE.eint<4>) ->!FHE.eint<4> { // extracting the first chunk of x, adjusted for shifting%cst =arith.constant dense<[0,0,0,0,4,4,4,4,8,8,8,8,12,12,12,12]> : tensor<16xi64>%0="FHE.apply_lookup_table"(%arg0,%cst) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // extracting the first chunk of y%cst_0 =arith.constant dense<[0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3]> : tensor<16xi64>%1="FHE.apply_lookup_table"(%arg1,%cst_0) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // packing first chunks%2="FHE.add_eint"(%0,%1) : (!FHE.eint<4>,!FHE.eint<4>) ->!FHE.eint<4> // applying the bitwise operation to first chunks, adjusted for addition in the end%cst_1 =arith.constant dense<[0,0,0,0,0,4,0,4,0,0,8,8,0,4,8,12]> : tensor<16xi64>%3="FHE.apply_lookup_table"(%2,%cst_1) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // extracting the second chunk of x, adjusted for shifting%cst_2 =arith.constant dense<[0,4,8,12,0,4,8,12,0,4,8,12,0,4,8,12]> : tensor<16xi64>%4="FHE.apply_lookup_table"(%arg0,%cst_2) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // extracting the second chunk of y%cst_3 =arith.constant dense<[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]> : tensor<16xi64>%5="FHE.apply_lookup_table"(%arg1,%cst_3) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // packing second chunks%6="FHE.add_eint"(%4,%5) : (!FHE.eint<4>,!FHE.eint<4>) ->!FHE.eint<4> // applying the bitwise operation to second chunks%cst_4 =arith.constant dense<[0,0,0,0,0,1,0,1,0,0,2,2,0,1,2,3]> : tensor<16xi64>%7="FHE.apply_lookup_table"(%6,%cst_4) : (!FHE.eint<4>, tensor<16xi64>) ->!FHE.eint<4> // adding resulting chunks to obtain the result%8="FHE.add_eint"(%7,%3) : (!FHE.eint<4>,!FHE.eint<4>) ->!FHE.eint<4>return%8 : !FHE.eint<4> }}
Packing Trick
This implementation uses the fact that we can combine two values into a single value and apply a single table lookup to this combined value!
There are two major problems with this implementation though:
packing requires the same bit-width across operands.
packing requires the bit-width of at least x.bit_width + y.bit_width and that bit-width cannot exceed maximum TLU bit-width, which is 16 at the moment.
What this means is if we are comparing uint3 and uint6, we need to convert both of them to uint9 in some way to do the packing and proceed with the TLU in 9-bits. There are 4 ways to achieve this behavior.
This strategy makes sure that during bit-width assignment, both operands are assigned the same bit-width, and that bit-width contains at least the amount of bits required to store pack(x, y). The idea is:
It will significantly increase the bit-width of both operands and lock them to each other across the whole circuit, which can result in significant slowdowns if the operands are used in other costly operations.
This strategy will not put any constraint in bit-widths during bit-width assignment, instead operands are cast to a bit-width that can store pack(x, y) during runtime using table lookups. The idea is:
It can result in a single table lookup as well, if x and y are assigned (because of other operations) the same bit-width, and that bit-width can store pack(x, y).
Or in two table lookups if only one of the operands is assigned a bit-width bigger than or equal to the bit width that can store pack(x, y).
Pros
It will not put any constraints on bit-widths of the operands, which is amazing if they are used in other costly operations.
It will result in at most 3 table lookups, which is still good.
Cons
If you are not doing anything else with the operands, or doing less costly operations compared to bitwise, it will introduce up to two unnecessary table lookups and slow down execution compared to fhe.BitwiseStrategy.ONE_TLU_PROMOTED.
This strategy is like the middle ground between the two strategies described above. With this strategy, only the bigger operand will be constrained to have at least the required bit-width to store pack(x, y), and the smaller operand will be cast to that bit-width during runtime. The idea is:
It can result in a single table lookup as well, if the smaller operand is assigned (because of other operations) the same bit-width as the bigger operand.
Pros
It will only put constraint on the bigger operand, which is great if the smaller operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
Cons
It will significantly increase the bit-width of the bigger operand which can result in significant slowdowns if the bigger operand is used in other costly operations.
If you are not doing anything else with the smaller operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED.
This strategy is like the exact opposite of the strategy above. With this, only the smaller operand will be constrained to have at least the required bit-width, and the bigger operand will be cast during runtime. The idea is:
It can result in a single table lookup as well, if the bigger operand is assigned (because of other operations) the same bit-width as the smaller operand.
Pros
It will only put constraint on the smaller operand, which is great if the bigger operand is used in other costly operations.
It will result in at most 2 table lookups, which is great.
Cons
It will increase the bit-width of the smaller operand which can result in significant slowdowns if the smaller operand is used in other costly operations.
If you are not doing anything else with the bigger operand, or doing less costly operations compared to comparison, it could introduce an unnecessary table lookup and slow down execution compared to fhe.BitwiseStrategy.THREE_TLU_CASTED.
Concrete will choose the best strategy available after bit-width assignment, regardless of the specified preference.
Different strategies are good for different circuits. If you want the best runtime for your use case, you can compile your circuit with all different comparison strategy preferences, and pick the one with the lowest complexity.
Shifts
The same configuration option is used to modify the behavior of encrypted shift operations, and shifts are much more complex to implement, so we'll not go over the details. What is important is, the end the result is computed using additions or subtractions on the original shifted operand. Since additions and subtractions require the same bit-width across operands, input and output bit-widths need to be synchronized at some point. There are two ways to do this:
With promotion
In this way, shifted operand and shift result is assigned the same bit-width during bit-width assignment, which avoids an additional TLU on the shifted operand, but it might increase the bit-width of the result or the shifted operand, and if they're used in other costly operations, it could result in significant slowdowns. This is the default behavior.
Approach described above could be suboptimal for some circuits, so it is advised to check the complexity with it disabled before production. Here is how the implementation changes with it disabled.