This repository contains the federated, distributed implementation of the privacy-preserving tree-based model proposed in the paper Privacy-preserving tree-based machine learning for data heterogenous and low-data regimes. The paper will be submitted soon.
This repository builds upon the helium-artifacts repository containing the artifacts of the paper Helium: Scalable MPC among Lightweight Participants and under Churn by Mouchet et al. The paper appeared at CCS 2024 and is available at https://eprint.iacr.org/2024/194. We thank Christian Mouchet and his co-authors for providing such great reproducible code and allowing it's reuse!
- The present artifact repository
- imports the Helium repository at a pre-release version of
v0.2.2 - contains:
- a Helium application implementing collaborative and privacy-preserving training of a completely random forest on tabular and horizontal federated data
- scripts for building and running experiments with it.
- hosted at https://github.com/phoeinx/PP-CRF
- mirrored at TODO
- imports the Helium repository at a pre-release version of
- The Helium repository
- contains: the code for the Helium system
- hosted at https://github.com/ChristianMct/helium
- mirrored at https://zenodo.org/doi/10.5281/zenodo.11045945
This section details the procedure for building and running the Completely Random Forest experiments.
The following software are required on the machine(s) running the experiments (see below for an automated way of setting up the machines). The version numbers are those used for the paper's results, but are only indicative.
The following Python packages are also required:
docker, version 7.0.0paramiko, version 3.4.0pandas, version 2.0.0scikit-learn, version 1.3.0
Ansible can be used to setup all the above dependencies on ssh-accessible machines (the following instructions include the related command).
In this first part, we cover the steps to run a small scale test experiment, to demonstrate the process. We assume it is performed on a local machine for which the requirements are already setup. If you are planning to work on a server directly, please see the next part as it includes an automated setup from SSH.
- Clone the artifact repository:
git clone https://github.com/phoeinx/PP-CRF && cd PP-CRF - Build the experiment Docker image:
make helium - Run the experiment:
python3 helium/exp_runner/main.py >> results
This last command runs the experiments for a grid of parameters and stores the results in ./results.
By default, the experiment and grid parameters represent a small set of experiments, for local test purposes.
To reproduce the results of the paper, larger scale experiments may require two servers.
For this part, we assume that the steps above have been performed on a local machine that has publickey SSH access to two servers
with host names <host1> and <host2>, and that <host1> has publickey SSH access to <host2>. In the steps below <host1> will drive
the experiment and run the session nodes, while <host2> will run the helper "cloud".
- Setup the servers with Ansible:
ansible-playbook -i <host1>,<host2> conf/ansible/setup_server.pb.yml - SSH into
<host1>andcd PP-CRF - Open the experiment runner script at
helium/exp_runner/main.py - Change the docker host name for the cloud:
CLOUD_HOST = 'localhost'=>CLOUD_HOST = '<host2>' - Run the experiment:
python3 helium/exp_runner/main.py >> results
The current implementation only supports numerical datasets. Categorical features are not yet supported.
The experiment grid is defined at the top of helium/exp_runner/main.py. The key parameters are:
# ====== Experiment Grid ======
N_PARTIES = [2, 5, 10, 50, 100, 200, 300] # number of federated parties
NUMBER_ESTIMATORS = [100] # number of trees in the forest
TREE_DEPTH = [3] # depth of each tree
NON_PARTICIPATION_PROB = [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9] # probability a party skips a record
# ====== Cross-validation ======
N_FOLDS = 10 # number of stratified k-fold splits
FOLD_RANDOM_SEED = 0
# ====== Datasets ======
DATASETS = [
"preprocessed_Breast Cancer Wisconsin (Original).csv",
"preprocessed_MAGIC Gamma Telescope.csv",
"preprocessed_TCGA.csv",
]The SKIP_TO variable enables restarting from a specific experiment number in the grid if a run is interrupted.
Each experiment produces a JSON line on stdout. The fields are:
-
n_party: number of federated parties ($N$ ) -
threshold: cryptographic threshold ($T$ ) — number of parties required for decryption -
failure_rate: simulated node failure rate in failures/min -
failure_duration: mean failure duration in minutes -
rep: repetition index -
rate_limit: outgoing network rate limit applied to party containers -
delay: outgoing network delay applied to party containers -
TimeSetup: wall time in seconds for the MHE setup phase (key generation), measured at the cloud -
SentSetup: setup-phase outgoing traffic in MB at the cloud -
RecvSetup: setup-phase incoming traffic in MB at the cloud -
TimeCompute: wall time in seconds for the compute phase (all circuit evaluations), measured at the cloud -
SentCompute: compute-phase outgoing traffic in MB at the cloud -
RecvCompute: compute-phase incoming traffic in MB at the cloud -
theoretical_node_online: expected number of online nodes at equilibrium -
theoretical_time_above_thresh: expected fraction of time with at least$T$ nodes online -
actual_node_online: empirical average number of online nodes during the experiment -
actual_time_above_thresh: empirical fraction of time with at least$T$ nodes online -
dataset: name of the dataset used -
n_estimators: number of trees in the forest -
tree_depth: depth of each tree -
non_participation_prob: probability that a party skips contributing to a given record -
fold: cross-validation fold index
The experiment runner executes all experiments in the grid, outputting each result on a new line.
The configuration of the Helium nodes can be found in the genConfigForNode function of helium/app/main.go.
The current configuration has the following characteristics:
- FHE parameters: polynomial degree
$2^{12}$ , coefficient modulus of 109 bits (LogQ: [45, 45],LogP: [19]), plaintext modulus 79873. - Protocol concurrency: up to 64 concurrent protocols per node (
MaxProtoPerNode,MaxParticipation,MaxAggregation). - Up to 16 concurrently running circuits (
MaxCircuitEvaluation).