This document explains how to train SGD Logistic Regression on encrypted data.

Training on encrypted data is done through an FHE program that is generated by Concrete ML, based on the characteristics of the data that are given to the `fit`

function. Once the FHE program associated with the `SGDClassifier`

object has fit the encrypted data, it performs specifically to that data's distribution and dimensionality.

When deploying encrypted training services, you need to consider the type of data that future users of your services will train on:

The distribution of the data should match to achieve good accuracy

The dimensionality of the data needs to match since the deployed FHE programs are compiled for a fixed number of dimensions.

See the deployment section for more details.

Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional *differential privacy* instead of encryption. Concrete ML can import models trained through federated learning using 3rd party tools. All model types are supported - linear, tree-based and neural networks - through the `from_sklearn_model`

function and the `compile_torch_model`

function.

Example

The logistic regression training example shows logistic regression training on encrypted data in action.

The following snippet shows how to instantiate a logistic regression model that trains on encrypted data:

To activate encrypted training, simply set `fit_encrypted=True`

in the constructor. When the value is set, Concrete ML generates an FHE program which, when called through the `fit`

function, processes encrypted training data, labels and initial weights and outputs trained model weights. If this value is not set, training is performed on clear data using `scikit-learn`

gradient descent.

Next, to perform the training on encrypted data, call the `fit`

function with the `fhe="execute"`

argument:

Training configuration

The `max_iter`

parameter controls the number of batches that are processed by the training algorithm.

Capabilities and Limitations

The trainable logistic model uses Stochastic Gradient Descent (SGD) and quantizes the data, weights, gradients and the error measure. It currently supports training 6-bit models, including g both the coefficients and the bias.

The `SGDClassifier`

does not currently support training models with other bit-width values. The execution time to train a model is proportional to the number of features and the number of training examples in the batch. The `SGDClassifier`

training does not currently support client/server deployment for training.

Deployment

Once you have tested an `SGDClassifier`

that trains on encrypted data, you can build an FHE training service by deploying the FHE training program of the `SGDClassifier`

. See the Production Deloyment page for more details on how to the Concrete ML deployment utility classes. To deploy an FHE training program, you must pass the `mode='training'`

parameter to the `FHEModelDev`

class.

The `parameters_range`

parameter determines the initialization of the coefficients and the bias of the logistic regression. It is recommended to give values that are close to the min/max of the training data. It is also possible to normalize the training data so that it lies in the range $[-1, 1]$.