Pruning

Pruning is a method to reduce neural network complexity, usually applied in order to reduce the computation cost or memory size. Pruning is used in Concrete-ML to control the size of accumulators in neural networks, thus making them FHE-compatible. See here for an explanation of accumulator bit-width constraints.

Overview of pruning in Concrete-ML

Pruning is used in Concrete-ML for two types of neural networks:

  1. Built-in neural networks include a pruning mechanism that can be parameterized by the user. The pruning type is based on L1-norm. To comply with FHE constraints, Concrete-ML uses unstructured pruning, as the aim is not to eliminate neurons or convolutional filters completely, but to decrease their accumulator bit-width.

  2. Custom neural networks, to work well under FHE constraints, should include pruning. When implemented with PyTorch, you can use the framework's pruning mechanism (e.g.L1-Unstructured) to good effect.

Basics of pruning

In neural networks, a neuron computes a linear combination of inputs and learned weights, then applies an activation function.

The neuron computes:

When building a full neural network, each layer will contain multiple neurons, which are connected to the neuron outputs of a previous layer or to the inputs.

Fixing some of the weights to 0 makes the network graph look more similar to the following:

While pruning weights can reduce the prediction performance of the neural network, studies show that a high level of pruning (above 50%) can often be applied. See here how Concrete-ML uses pruning in Fully Connected Neural Networks.

Pruning in practice

In the formula above, in the worst-case, the maximum number of the input and weights that can make the result exceed $n$ bits is given by:

Last updated