Federated Completely Random Forests

This repository contains the federated, distributed implementation of the privacy-preserving tree-based model proposed in the paper Privacy-preserving tree-based machine learning for data heterogenous and low-data regimes. The paper will be submitted soon.

This repository builds upon the helium-artifacts repository containing the artifacts of the paper Helium: Scalable MPC among Lightweight Participants and under Churn by Mouchet et al. The paper appeared at CCS 2024 and is available at https://eprint.iacr.org/2024/194. We thank Christian Mouchet and his co-authors for providing such great reproducible code and allowing it's reuse!

List of artifacts

The present artifact repository
- imports the Helium repository at a pre-release version of v0.2.2
- contains:
  - a Helium application implementing collaborative and privacy-preserving training of a completely random forest on tabular and horizontal federated data
  - scripts for building and running experiments with it.
- hosted at https://github.com/phoeinx/PP-CRF
- mirrored at TODO

Dependencies

The Helium repository
- contains: the code for the Helium system
- hosted at https://github.com/ChristianMct/helium
- mirrored at https://zenodo.org/doi/10.5281/zenodo.11045945

Instructions

This section details the procedure for building and running the Completely Random Forest experiments.

Setup

The following software are required on the machine(s) running the experiments (see below for an automated way of setting up the machines). The version numbers are those used for the paper's results, but are only indicative.

Docker, version 26.1.1
Python 3, version 3.10.12
make, version 4.3

The following Python packages are also required:

docker, version 7.0.0
paramiko, version 3.4.0
pandas, version 2.0.0
scikit-learn, version 1.3.0

Ansible can be used to setup all the above dependencies on ssh-accessible machines (the following instructions include the related command).

Running locally

In this first part, we cover the steps to run a small scale test experiment, to demonstrate the process. We assume it is performed on a local machine for which the requirements are already setup. If you are planning to work on a server directly, please see the next part as it includes an automated setup from SSH.

Clone the artifact repository: git clone https://github.com/phoeinx/PP-CRF && cd PP-CRF
Build the experiment Docker image: make helium
Run the experiment: python3 helium/exp_runner/main.py >> results

This last command runs the experiments for a grid of parameters and stores the results in ./results. By default, the experiment and grid parameters represent a small set of experiments, for local test purposes. To reproduce the results of the paper, larger scale experiments may require two servers.

Running on two servers

For this part, we assume that the steps above have been performed on a local machine that has publickey SSH access to two servers with host names <host1> and <host2>, and that <host1> has publickey SSH access to <host2>. In the steps below <host1> will drive the experiment and run the session nodes, while <host2> will run the helper "cloud".

Setup the servers with Ansible: ansible-playbook -i <host1>,<host2> conf/ansible/setup_server.pb.yml
SSH into <host1> and cd PP-CRF
Open the experiment runner script at helium/exp_runner/main.py
Change the docker host name for the cloud: CLOUD_HOST = 'localhost' => CLOUD_HOST = '<host2>'
Run the experiment: python3 helium/exp_runner/main.py >> results

Limitations

The current implementation only supports numerical datasets. Categorical features are not yet supported.

Controlling the experiment parameters and grid

The experiment grid is defined at the top of helium/exp_runner/main.py. The key parameters are:

# ====== Experiment Grid ======
N_PARTIES = [2, 5, 10, 50, 100, 200, 300]  # number of federated parties
NUMBER_ESTIMATORS = [100]                   # number of trees in the forest
TREE_DEPTH = [3]                            # depth of each tree
NON_PARTICIPATION_PROB = [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9]  # probability a party skips a record

# ====== Cross-validation ======
N_FOLDS = 10           # number of stratified k-fold splits
FOLD_RANDOM_SEED = 0

# ====== Datasets ======
DATASETS = [
    "preprocessed_Breast Cancer Wisconsin (Original).csv",
    "preprocessed_MAGIC Gamma Telescope.csv",
    "preprocessed_TCGA.csv",
]

The SKIP_TO variable enables restarting from a specific experiment number in the grid if a run is interrupted.

Result format

Each experiment produces a JSON line on stdout. The fields are:

n_party: number of federated parties ($N$)
threshold: cryptographic threshold ($T$) — number of parties required for decryption
failure_rate: simulated node failure rate in failures/min
failure_duration: mean failure duration in minutes
rep: repetition index
rate_limit: outgoing network rate limit applied to party containers
delay: outgoing network delay applied to party containers
TimeSetup: wall time in seconds for the MHE setup phase (key generation), measured at the cloud
SentSetup: setup-phase outgoing traffic in MB at the cloud
RecvSetup: setup-phase incoming traffic in MB at the cloud
TimeCompute: wall time in seconds for the compute phase (all circuit evaluations), measured at the cloud
SentCompute: compute-phase outgoing traffic in MB at the cloud
RecvCompute: compute-phase incoming traffic in MB at the cloud
theoretical_node_online: expected number of online nodes at equilibrium
theoretical_time_above_thresh: expected fraction of time with at least $T$ nodes online
actual_node_online: empirical average number of online nodes during the experiment
actual_time_above_thresh: empirical fraction of time with at least $T$ nodes online
dataset: name of the dataset used
n_estimators: number of trees in the forest
tree_depth: depth of each tree
non_participation_prob: probability that a party skips contributing to a given record
fold: cross-validation fold index

The experiment runner executes all experiments in the grid, outputting each result on a new line.

Further configuration of Helium

The configuration of the Helium nodes can be found in the genConfigForNode function of helium/app/main.go. The current configuration has the following characteristics:

FHE parameters: polynomial degree $2^{12}$, coefficient modulus of 109 bits (LogQ: [45, 45], LogP: [19]), plaintext modulus 79873.
Protocol concurrency: up to 64 concurrent protocols per node (MaxProtoPerNode, MaxParticipation, MaxAggregation).
Up to 16 concurrently running circuits (MaxCircuitEvaluation).

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
conf/ansible		conf/ansible
helium		helium
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Federated Completely Random Forests

List of artifacts

Dependencies

Instructions

Setup

Running locally

Running on two servers

Limitations

Controlling the experiment parameters and grid

Result format

Further configuration of Helium

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Federated Completely Random Forests

List of artifacts

Dependencies

Instructions

Setup

Running locally

Running on two servers

Limitations

Controlling the experiment parameters and grid

Result format

Further configuration of Helium

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages