This document explains how to train SGD Logistic Regression on encrypted data.
Training on encrypted data is done through an FHE program that is generated by Concrete ML, based on the characteristics of the data that are given to the fit
function. Once the FHE program associated with the SGDClassifier
object has fit the encrypted data, it performs specifically to that data's distribution and dimensionality.
When deploying encrypted training services, you need to consider the type of data that future users of your services will train on:
The distribution of the data should match to achieve good accuracy
The dimensionality of the data needs to match since the deployed FHE programs are compiled for a fixed number of dimensions.
See the deployment section for more details.
Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional differential privacy instead of encryption. Concrete ML can import models trained through federated learning using 3rd party tools. All model types are supported - linear, tree-based and neural networks - through the from_sklearn_model
function and the compile_torch_model
function.
The logistic regression training example shows logistic regression training on encrypted data in action.
The following snippet shows how to instantiate a logistic regression model that trains on encrypted data:
To activate encrypted training, simply set fit_encrypted=True
in the constructor. When the value is set, Concrete ML generates an FHE program which, when called through the fit
function, processes encrypted training data, labels and initial weights and outputs trained model weights. If this value is not set, training is performed on clear data using scikit-learn
gradient descent.
Next, to perform the training on encrypted data, call the fit
function with the fhe="execute"
argument:
The max_iter
parameter controls the number of batches that are processed by the training algorithm.
The trainable logistic model uses Stochastic Gradient Descent (SGD) and quantizes the data, weights, gradients and the error measure. It currently supports training 6-bit models, including g both the coefficients and the bias.
The SGDClassifier
does not currently support training models with other bit-width values. The execution time to train a model is proportional to the number of features and the number of training examples in the batch. The SGDClassifier
training does not currently support client/server deployment for training.
Once you have tested an SGDClassifier
that trains on encrypted data, you can build an FHE training service by deploying the FHE training program of the SGDClassifier
. See the Production Deloyment page for more details on how to the Concrete ML deployment utility classes. To deploy an FHE training program, you must pass the mode='training'
parameter to the FHEModelDev
class.
The parameters_range
parameter determines the initialization of the coefficients and the bias of the logistic regression. It is recommended to give values that are close to the min/max of the training data. It is also possible to normalize the training data so that it lies in the range .