feat: reconcile config with runtime distributed state by Jayce-Ping · Pull Request #162 · X-GenGroup/Flow-Factory

Jayce-Ping · 2026-05-22T06:25:25Z

Summary

Extract shared env var mappings into utils/env_utils.py as a single source of truth (previously duplicated in cli.py and a now-removed logger/runtime_config.py)
Add reconcile_config() that patches the Arguments dataclass in-place with actual runtime values from Accelerator + cluster env vars
Call it once in load_trainer() after Accelerator creation — before any consumer (model adapter, trainer, logger) reads config — so config.to_dict() everywhere reflects the true distributed state (num_processes, process_index, num_machines, gpus_per_node, etc.)
Logger module remains completely decoupled from accelerate

Test plan

Verify from flow_factory.utils.env_utils import reconcile_config imports cleanly
Verify from flow_factory.logger import load_logger works without accelerate at import time
Verify Arguments().to_dict() excludes runtime fields when they are None
Run single-node training and confirm W&B/SwanLab/TensorBoard config shows correct num_processes
Run multi-node training and confirm env-derived fields (num_machines, gpus_per_node, machine_rank, main_process_ip) appear in logged config

🤖 Generated with Claude Code

…ted state Extract shared env var mappings into utils/env_utils.py (single source of truth for cli.py and runtime reconciliation) and add a reconcile_config() function that patches the Arguments dataclass in-place with actual runtime values from the Accelerator and cluster environment variables. Called once in load_trainer() after Accelerator creation — before any consumer (model adapter, trainer, logger) reads config — so that config.to_dict() everywhere reflects the true distributed state (num_processes, process_index, num_machines, gpus_per_node, etc.). The logger module remains completely decoupled from accelerate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR centralizes multi-node environment variable handling and ensures the runtime Arguments config reflects the actual distributed state after Accelerator initialization, so downstream consumers (especially logging) observe consistent, correct values.

Changes:

Added utils/env_utils.py with shared env-var mapping/lookup and a reconcile_config() helper to patch runtime distributed fields into Arguments.
Updated load_trainer() to call reconcile_config() immediately after creating the Accelerator.
Updated CLI and multinode docs to rely on the shared env-var mapping utilities and document runtime reconciliation behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/flow_factory/utils/env_utils.py`	Introduces shared env var mapping + `reconcile_config()` to sync config with runtime distributed metadata.
`src/flow_factory/trainers/loader.py`	Calls `reconcile_config()` after `Accelerator` creation so subsequent components read reconciled values.
`src/flow_factory/hparams/args.py`	Adds runtime distributed fields (process indices, machine metadata, master IP) to `Arguments`.
`src/flow_factory/cli.py`	Switches CLI env detection to use shared `ENV_VAR_MAPPINGS` / `env_lookup`.
`multinode_examples/README.md`	Documents that configs are reconciled at runtime for accurate logged metadata.

Comments suppressed due to low confidence (3)

src/flow_factory/utils/env_utils.py:93

When num_machines is set, gpus_per_node is computed via integer division (num_processes // num_machines) without validating that num_processes is divisible by num_machines. If the values ever disagree, this will silently truncate and log incorrect metadata; consider checking for a remainder and either raising/logging, or preferring an explicit gpus_per_node env var when present.

            num_machines = int(num_machines_str)
            config.num_machines = num_machines
            if num_machines > 0:
                config.gpus_per_node = accelerator.num_processes // num_machines
        except ValueError:

src/flow_factory/utils/env_utils.py:80

Invalid integer values in cluster env vars are silently ignored (except ValueError: pass). This makes misconfiguration hard to debug and can leave defaults in place without any indication; consider logging a warning (including the env var name/value) when parsing fails.

    port_str = env_lookup("master_port")
    if port_str is not None:
        try:
            config.main_process_port = int(port_str)
        except ValueError:
            pass

src/flow_factory/cli.py:48

num_machines_str comes from env vars and is cast with int(...) without any error handling. If a scheduler sets a non-integer value, the CLI will crash before printing a helpful message; consider wrapping this parse (and the other int(...) parses in this function) with a try/except that logs a clear error and falls back to {}.

    master_ip = env_lookup("master_ip")
    num_machines_str = env_lookup("num_machines")

    # Consider it a multi-node environment only when both master_ip and num_machines > 1 are present
    if not master_ip or not num_machines_str:
        return {}

    num_machines = int(num_machines_str)
    if num_machines <= 1:
        return {}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    # Authoritative values from accelerator
+    config.num_processes = accelerator.num_processes
+    config.mixed_precision = accelerator.mixed_precision  # type: ignore[assignment]
+    config.process_index = accelerator.process_index
+    config.local_process_index = accelerator.local_process_index


…pe: ignore Address review feedback: replace `type: ignore[assignment]` with a runtime check against the declared Literal values. If Accelerate ever returns an unexpected string, the original config value is preserved rather than silently overwritten. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ad of type: ignore" This reverts commit 9bb09f6.

Brings trellis2 branch up-to-date with origin/main, integrating: - PR X-GenGroup#170: DiffusionOPD on-policy distillation trainer - PR X-GenGroup#168: Multi-dataset training with per-source reward routing - PR X-GenGroup#165: GenEval reward + unified trainer sampling pipeline - PR X-GenGroup#163: hparams/training_args.py → package split - PR X-GenGroup#162: Reconcile config with runtime distributed state - PRs X-GenGroup#121,X-GenGroup#146-X-GenGroup#161: CRD algorithm, Docker CUDA 12.9, HF resume, etc. Conflict resolutions: - trainers/registry.py: merged both sides (trellis2_grpo/nft + crd/opd) - rewards/registry.py: merged both sides (trellis2 rewards + geneval) - hparams/training_args.py: deleted (accept main's package split), added trellis2_grpo/nft entries to _registry.py - trainers/grpo.py: removed duplicate evaluate() (unified in BaseTrainer) - trainers/nft.py: adopted generate_samples(), removed duplicate evaluate() - trainers/abc.py: added _extra_eval_inference_kwargs() hook to BaseTrainer evaluate() so Trellis2TrainerMixin can inject stages/render_kwargs - samples/samples.py: merged source/source_id fields with trellis2 additions - rewards/reward_processor.py: merged source-aware gating with trellis2's _store_reward_extra_info - data_utils/loader.py: merged multi-source pipeline with image_3d dataset - data_utils/sampler_loader.py: merged refactored parameters - hparams/args.py: merged multi-source alignment with trellis2's group-aligned - guidance/rewards.md: merged both (PickScore_TextImage_Sum + GenEval) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 22, 2026 06:25

Copilot started reviewing on behalf of Jayce-Ping May 22, 2026 06:25 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Jayce-Ping and others added 2 commits May 22, 2026 14:34

Revert "[utils] fix: validate mixed_precision before assignment inste…

5fb65c8

…ad of type: ignore" This reverts commit 9bb09f6.

Jayce-Ping merged commit 3ac3cf2 into main May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: reconcile config with runtime distributed state#162

feat: reconcile config with runtime distributed state#162
Jayce-Ping merged 3 commits into
mainfrom
feat/reconcile-runtime-config

Jayce-Ping commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jayce-Ping commented May 22, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants