Skip to content

phoeinx/PP-CRF

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Federated Completely Random Forests

This repository contains the federated, distributed implementation of the privacy-preserving tree-based model proposed in the paper Privacy-preserving tree-based machine learning for data heterogenous and low-data regimes. The paper will be submitted soon.

This repository builds upon the helium-artifacts repository containing the artifacts of the paper Helium: Scalable MPC among Lightweight Participants and under Churn by Mouchet et al. The paper appeared at CCS 2024 and is available at https://eprint.iacr.org/2024/194. We thank Christian Mouchet and his co-authors for providing such great reproducible code and allowing it's reuse!

List of artifacts

  • The present artifact repository
    • imports the Helium repository at a pre-release version of v0.2.2
    • contains:
      • a Helium application implementing collaborative and privacy-preserving training of a completely random forest on tabular and horizontal federated data
      • scripts for building and running experiments with it.
    • hosted at https://github.com/phoeinx/PP-CRF
    • mirrored at TODO

Dependencies

Instructions

This section details the procedure for building and running the Completely Random Forest experiments.

Setup

The following software are required on the machine(s) running the experiments (see below for an automated way of setting up the machines). The version numbers are those used for the paper's results, but are only indicative.

The following Python packages are also required:

  • docker, version 7.0.0
  • paramiko, version 3.4.0
  • pandas, version 2.0.0
  • scikit-learn, version 1.3.0

Ansible can be used to setup all the above dependencies on ssh-accessible machines (the following instructions include the related command).

Running locally

In this first part, we cover the steps to run a small scale test experiment, to demonstrate the process. We assume it is performed on a local machine for which the requirements are already setup. If you are planning to work on a server directly, please see the next part as it includes an automated setup from SSH.

  1. Clone the artifact repository: git clone https://github.com/phoeinx/PP-CRF && cd PP-CRF
  2. Build the experiment Docker image: make helium
  3. Run the experiment: python3 helium/exp_runner/main.py >> results

This last command runs the experiments for a grid of parameters and stores the results in ./results. By default, the experiment and grid parameters represent a small set of experiments, for local test purposes. To reproduce the results of the paper, larger scale experiments may require two servers.

Running on two servers

For this part, we assume that the steps above have been performed on a local machine that has publickey SSH access to two servers with host names <host1> and <host2>, and that <host1> has publickey SSH access to <host2>. In the steps below <host1> will drive the experiment and run the session nodes, while <host2> will run the helper "cloud".

  1. Setup the servers with Ansible: ansible-playbook -i <host1>,<host2> conf/ansible/setup_server.pb.yml
  2. SSH into <host1> and cd PP-CRF
  3. Open the experiment runner script at helium/exp_runner/main.py
  4. Change the docker host name for the cloud: CLOUD_HOST = 'localhost' => CLOUD_HOST = '<host2>'
  5. Run the experiment: python3 helium/exp_runner/main.py >> results

Limitations

The current implementation only supports numerical datasets. Categorical features are not yet supported.

Controlling the experiment parameters and grid

The experiment grid is defined at the top of helium/exp_runner/main.py. The key parameters are:

# ====== Experiment Grid ======
N_PARTIES = [2, 5, 10, 50, 100, 200, 300]  # number of federated parties
NUMBER_ESTIMATORS = [100]                   # number of trees in the forest
TREE_DEPTH = [3]                            # depth of each tree
NON_PARTICIPATION_PROB = [0, 0.1, 0.2, 0.3, 0.5, 0.7, 0.9]  # probability a party skips a record

# ====== Cross-validation ======
N_FOLDS = 10           # number of stratified k-fold splits
FOLD_RANDOM_SEED = 0

# ====== Datasets ======
DATASETS = [
    "preprocessed_Breast Cancer Wisconsin (Original).csv",
    "preprocessed_MAGIC Gamma Telescope.csv",
    "preprocessed_TCGA.csv",
]

The SKIP_TO variable enables restarting from a specific experiment number in the grid if a run is interrupted.

Result format

Each experiment produces a JSON line on stdout. The fields are:

  • n_party: number of federated parties ($N$)
  • threshold: cryptographic threshold ($T$) — number of parties required for decryption
  • failure_rate: simulated node failure rate in failures/min
  • failure_duration: mean failure duration in minutes
  • rep: repetition index
  • rate_limit: outgoing network rate limit applied to party containers
  • delay: outgoing network delay applied to party containers
  • TimeSetup: wall time in seconds for the MHE setup phase (key generation), measured at the cloud
  • SentSetup: setup-phase outgoing traffic in MB at the cloud
  • RecvSetup: setup-phase incoming traffic in MB at the cloud
  • TimeCompute: wall time in seconds for the compute phase (all circuit evaluations), measured at the cloud
  • SentCompute: compute-phase outgoing traffic in MB at the cloud
  • RecvCompute: compute-phase incoming traffic in MB at the cloud
  • theoretical_node_online: expected number of online nodes at equilibrium
  • theoretical_time_above_thresh: expected fraction of time with at least $T$ nodes online
  • actual_node_online: empirical average number of online nodes during the experiment
  • actual_time_above_thresh: empirical fraction of time with at least $T$ nodes online
  • dataset: name of the dataset used
  • n_estimators: number of trees in the forest
  • tree_depth: depth of each tree
  • non_participation_prob: probability that a party skips contributing to a given record
  • fold: cross-validation fold index

The experiment runner executes all experiments in the grid, outputting each result on a new line.

Further configuration of Helium

The configuration of the Helium nodes can be found in the genConfigForNode function of helium/app/main.go. The current configuration has the following characteristics:

  • FHE parameters: polynomial degree $2^{12}$, coefficient modulus of 109 bits (LogQ: [45, 45], LogP: [19]), plaintext modulus 79873.
  • Protocol concurrency: up to 64 concurrent protocols per node (MaxProtoPerNode, MaxParticipation, MaxAggregation).
  • Up to 16 concurrently running circuits (MaxCircuitEvaluation).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Go 52.2%
  • Python 41.4%
  • Jupyter Notebook 4.3%
  • Other 2.1%