Cosmos: Add *_OVERRIDE incident kill-switch env vars for hedging & PPCB, plus AZURE_COSMOS_HEDGING_ENABLED master switch#4562
Conversation
Introduces OperationOptions::hedging_enabled (env AZURE_COSMOS_HEDGING_ENABLED) to enable/disable cross-region read hedging, enabled by default. The env switch is the source of truth and takes precedence over the programmatic AvailabilityStrategy in both directions: false disables hedging even with an explicit Hedging(..), and true enables hedging even with an explicit Disabled (a programmatic Hedging(..) still supplies its custom threshold). Default threshold min(1000ms, request_timeout/2) is unchanged. Adds unit tests and updates CHANGELOGs and ConfigurationOptions.md.
…r_hedging_enablement
…ement
Adds a generic, opt-in {ENV}_OVERRIDE kill switch to the CosmosOptions derive
macro and applies it to the two enablement env vars (AZURE_COSMOS_HEDGING_ENABLED,
AZURE_COSMOS_PER_PARTITION_CIRCUIT_BREAKER_ENABLED). When set, an override wins
over every layer (including a per-request value and, for hedging, a programmatic
AvailabilityStrategy), acting as a fleet-wide incident kill switch that needs no
code change or redeploy. Resolution order becomes:
{ENV}_OVERRIDE -> operation -> account -> runtime -> {ENV}. Overrides are inert
unless set.
Macro: new `#[option(env = "...", overridable)]` attribute (parse.rs, requires
env); env.rs generates from_env_override()/from_env_override_vars() reading the
_OVERRIDE variants via a shared field-init helper; view.rs conditionally adds a
highest-priority env_override layer and a new_with_override() constructor while
keeping new() as a backward-compatible delegating wrapper.
Driver: mark hedging_enabled and per_partition_circuit_breaker_enabled
overridable; runtime builds env_override_operation_options from
from_env_override() with an accessor; both production OperationOptionsView
constructions use new_with_override().
Also bumps the driver's azure_data_cosmos_macros dependency to the local
workspace 0.2.0 via a path dependency (it was resolving the published 0.1.0 from
crates.io, so the local macro source was not being consumed).
Adds macro codegen/validation tests and driver override-resolution tests;
updates ConfigurationOptions.md and both CHANGELOGs.
Makes the CosmosOptions derive macro the single env-reading mechanism across the driver. ConnectionPoolOptions, DiagnosticsOptions, and the runtime CPU-refresh interval previously hand-rolled their AZURE_COSMOS_* reads via the parse_*_from_env functions in env_parsing.rs; they now declare a small internal `#[derive(CosmosOptions)] #[options(layers(runtime))]` env-config struct (the same machinery OperationOptions uses), read it once via the generated from_env(), and resolve `builder -> env -> default` with shared validation helpers. env_parsing.rs is now a resolution + validation module: the env-reading functions (parse_from_env, parse_optional_from_env, parse_duration_millis_from_env, parse_optional_duration_millis_from_env) are replaced by resolve_from_env, resolve_optional_from_env, resolve_duration_ms, and resolve_optional_duration_ms, which take the pre-read env value (parsed by the macro) and only perform resolution + bounds validation. No std::env::var calls remain outside the macro. Public resolved types (ConnectionPoolOptions, DiagnosticsOptions) and their getters are unchanged. All existing bounds / cross-field validation and error messages are preserved. Behavior note: a malformed env value (e.g. a non-numeric *_MS timeout) is now logged and ignored -- falling back to the default -- instead of failing runtime construction; bounds violations on a present value still hard-error as before. Documented in the driver CHANGELOG. Validated: cargo fmt / build --all-targets / clippy clean; 1776 driver tests pass; cargo doc -D warnings clean; dependent azure_data_cosmos SDK builds.
…r_hedging_enablement
AZURE_COSMOS_HEDGING_ENABLED env var to enable/disable cross-region read hedging (default on)*_OVERRIDE incident kill-switch env vars for hedging & PPCB, plus AZURE_COSMOS_HEDGING_ENABLED master switch
CI flagged 'livesite' (kill-switch wording in CHANGELOG.md and ConfigurationOptions.md) and 'unparseable' (connection_pool.rs comment) introduced by the _OVERRIDE / env-consolidation work. Added both to sdk/cosmos/.cspell.json ignoreWords; verified clean with eng/common/spelling/Invoke-Cspell.ps1.
…spec updates, test coverage)
Robustness (fail-open kill switch):
- Parse boolean env options leniently in the CosmosOptions derive macro so a
kill switch does not silently fail open on a common spelling. Accepts
true/false, 1/0, yes/no, on/off (case-insensitive) for both the base and
_OVERRIDE variants; unrecognized values are logged and ignored.
Docs (stale spec / contract):
- HEDGING_SPEC.md §4.4/§11.3/§11.3.1: document the AZURE_COSMOS_HEDGING_ENABLED
master switch and its _OVERRIDE kill switch, add the Priority 0 layer and the
{ENV}_OVERRIDE resolution chain, and remove the now-false "no env opt-out" /
"explicit opt-out always wins" statements.
- PARTITION_LEVEL_FAILOVER_SPEC.md: update the stale init sample to
new_with_override(...) matching the real CosmosDriver construction.
- ConfigurationOptions.md: note that the env vars and their _OVERRIDE variants
are read once at runtime-build time, the deliberate env-beats-code inversion
vs .NET/Java, and the lenient boolean parsing.
Tests:
- Add lenient-bool macro integration tests (common spellings, unparseable
ignored, _OVERRIDE leniency).
- Add end-to-end env_override resolution tests for hedging
(resolve_availability_strategy) and PPCB (PartitionFailoverConfig::from_options).
- Add from_env_vars name-mapping and leniency tests for the connection-pool,
diagnostics, and runtime env-config structs.
…ulator e2e test - Retarget the 5 CHANGELOG entries added by this PR from #4432 to #4562 (SDK and driver crates). - Add emulator E2E test (driver_hedging_kill_switch.rs) verifying AZURE_COSMOS_HEDGING_ENABLED_OVERRIDE flows through the real runtime build (lenient OFF -> Some(false)) and keeps the live create/read data path healthy. - Add DriverTestRunContext::runtime() accessor to inspect the runtime env-override layer.
There was a problem hiding this comment.
Pull request overview
This PR adds environment-driven enablement controls to Cosmos availability features (cross-region read hedging and the per-partition circuit breaker), including an incident-oriented {ENV}_OVERRIDE kill-switch layer that overrides all other configuration sources.
Changes:
- Extended
CosmosOptionsderive to support#[option(env = "...", overridable)], generating{ENV}_OVERRIDEparsing + a top-priorityenv_overrideview layer (vianew_with_override). - Introduced
OperationOptions::{hedging_enabled, per_partition_circuit_breaker_enabled}env master/override switches and updated hedging resolution to evaluate the env switch beforeavailability_strategy. - Consolidated env parsing for connection pool, diagnostics, and runtime refresh interval onto macro-generated
from_env(), switching malformed env values to warn-and-ignore (while keeping bounds violations as errors), plus added unit/E2E coverage and docs/changelog updates.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos/docs/ConfigurationOptions.md | Documents new master/override env vars and precedence. |
| sdk/cosmos/azure_data_cosmos/CHANGELOG.md | Records SDK-facing env kill switch + master switch additions. |
| sdk/cosmos/azure_data_cosmos_macros/tests/derive_cosmos_options.rs | Adds tests for lenient boolean parsing (base + override). |
| sdk/cosmos/azure_data_cosmos_macros/src/view.rs | Adds optional env_override layer + new_with_override constructor generation. |
| sdk/cosmos/azure_data_cosmos_macros/src/parse.rs | Parses new overridable attribute and derives {ENV}_OVERRIDE names. |
| sdk/cosmos/azure_data_cosmos_macros/src/lib.rs | Documents the new overridable option attribute behavior. |
| sdk/cosmos/azure_data_cosmos_macros/src/env.rs | Implements shared env parsing, including lenient bool parsing and override constructors. |
| sdk/cosmos/azure_data_cosmos_driver/tests/framework/test_client.rs | Exposes runtime for tests that need to inspect runtime-resolved env override state. |
| sdk/cosmos/azure_data_cosmos_driver/tests/emulator_tests/mod.rs | Registers new emulator E2E test module. |
| sdk/cosmos/azure_data_cosmos_driver/tests/emulator_tests/driver_hedging_kill_switch.rs | Adds emulator E2E wiring test for AZURE_COSMOS_HEDGING_ENABLED_OVERRIDE. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/operation_options.rs | Adds overridable hedging/PPCB enablement fields + related tests. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/mod.rs | Updates re-export to renamed env resolution helper (resolve_duration_ms). |
| sdk/cosmos/azure_data_cosmos_driver/src/options/env_parsing.rs | Refactors env helpers to “resolve pre-read env value” model; removes direct env reads. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/diagnostics_options.rs | Migrates diagnostics env parsing to macro-driven config + resolve helpers. |
| sdk/cosmos/azure_data_cosmos_driver/src/options/connection_pool.rs | Migrates connection-pool env parsing to macro-driven config + resolve helpers (fail-soft malformed env). |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs | Builds and exposes env_override_operation_options and uses macro-based runtime env config. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/partition_endpoint_state.rs | Adds tests verifying PPCB override precedence through resolution. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/hedging_eligibility.rs | Adds hedging_enabled Priority-0 logic and tests for override precedence/threshold behavior. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs | Wires env_override into init-time and per-operation view construction. |
| sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md | Updates spec sample to include env_override layer in view construction. |
| sdk/cosmos/azure_data_cosmos_driver/docs/HEDGING_SPEC.md | Updates spec to reflect new env master/override switches and precedence. |
| sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md | Records driver-side kill switch + env parsing consolidation behavior change. |
| sdk/cosmos/azure_data_cosmos_driver/Cargo.toml | Switches macros dependency to local path at v0.2.0. |
| sdk/cosmos/.cspell.json | Adds new terms used in docs/comments. |
| Cargo.lock | Updates lockfile for macros crate version/source change. |
- ConfigurationOptions.md: clarify that hedging_enabled=None means 'no env override' (defers to the programmatic AvailabilityStrategy), not unconditionally enabled. - driver_hedging_kill_switch.rs: use a panic-safe RAII EnvVarGuard to restore AZURE_COSMOS_HEDGING_ENABLED_OVERRIDE on drop, preventing the override from leaking into other tests if an assertion panics. - macros env.rs/view.rs: tighten substring test assertions (fn from_env_override (), fn new () ) so they no longer match the longer *_vars / new_with_override names.
analogrelay
left a comment
There was a problem hiding this comment.
Good start. Some fairly minor design suggestions. Not critically blocking, but this is a good opportunity to fix up all the Options layering (i.e. not just OperationOptions, but also DriverOptions, ConnectionPoolOptions, etc.) to use the same macro.
Address @analogrelay review: the new hedging/PPCB CHANGELOG entries were double-spaced; make them single-spaced to match the surrounding entries in both the SDK and driver CHANGELOGs.
…r_hedging_enablement
The two from_options_env_override_* tests referenced OperationOptions::per_partition_circuit_breaker_enabled and PartitionFailoverConfig::from_options, both removed by #4588 when PPCB enablement moved to driver-level PartitionFailoverOptions. They were dead code carried in by the main merge and broke 'cargo clippy --all-targets' (lib test) compilation.
…view) Addresses analogrelay review comment on RuntimeEnvConfig: 'Might want to align this with changes coming in #4588.' Removes the bespoke RuntimeEnvConfig CosmosOptions macro struct and routes AZURE_COSMOS_CPU_REFRESH_INTERVAL_MS through the shared parse_duration_millis_from_env helper, matching how #4588's driver-level PartitionFailoverOptions builder reads its duration env vars. All duration env vars now resolve on one path. Drops the now-obsolete RuntimeEnvConfig field-mapping unit test (the helper is covered by env_parsing tests) and updates the driver CHANGELOG to reflect the actual consolidation scope.
…ly macro mode (PR #4562 review) Addresses analogrelay review comment on DiagnosticsEnvConfig: 'Again, could this be combined into DiagnosticsOptions?' Adds a struct-level #[options(env_only)] mode to the CosmosOptions derive that emits only from_env()/from_env_vars() (no View, Builder, or Default), letting an existing builder type double as its own env-var source. DiagnosticsOptionsBuilder now derives it directly, removing the separate DiagnosticsEnvConfig struct. Includes macro integration tests (env_only coexists with a hand-written derive(Default)) and CHANGELOG entries.
…Options (PR #4562) The #4588 merge moved PPCB enablement off OperationOptions onto driver-level PartitionFailoverOptions, which dropped this PR's PPCB _OVERRIDE kill switch while the changelogs/docs still advertised it. This restores the feature on its new home. Adds PartitionFailoverOptions::circuit_breaker_enabled_override (env AZURE_COSMOS_PPCB_ENABLED_OVERRIDE, lenient boolean via new parse_optional_bool_from_env helper) plus the matching builder setter. The override is authoritative over BOTH the circuit_breaker_enabled option and the account property enable_per_partition_failover_behavior, applied at the two effective-PPCB resolution sites (PartitionEndpointState::new and LocationStateStore account-property refresh). Updates both CHANGELOGs and ConfigurationOptions.md to the correct env-var name/scope, and adds unit tests covering both override directions at the options and routing-state layers.
…ro parser attribute (PR #4562 review) Addresses analogrelay review comments: "Does this have to be separate from ConnectionPoolOptions?" (ConnectionPoolEnvConfig) and the suggestion to "add the ability to specify a parser function in the #[option(env, ...)] attribute" (env.rs field_init). Adds a #[option(env = "...", parser = path)] attribute to the CosmosOptions derive that parses an env var with a custom fn(&str) returning Option<T> instead of FromStr, supporting types like Duration (from a millisecond count) and the emulator cert-validation bool mapped to ServerCertificateValidation. ConnectionPoolOptionsBuilder now derives #[options(env_only)] with per-field env + parser attributes and doubles as its own env source, removing the separate ConnectionPoolEnvConfig struct. build() merges builder-or-env per field and reuses the existing resolve_* helpers, preserving the exact default / bounds / cross-field-validation semantics. Adds macro unit + integration tests for the parser attribute and updates the connection-pool env tests. CHANGELOGs updated.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
sdk/cosmos/azure_data_cosmos_macros/src/parse.rs:114
#[options(env_only)]is documented as mutually exclusive with#[options(layers(...))], but the parser currently allows both and will silently ignorelayers(...)becauseenv_onlyshort-circuits codegen. This should be rejected at macro-parse time to avoid surprising/ambiguous behavior.
let (layers, env_only) = parse_options_attr(&ast.attrs)?;
if !env_only && layers.is_empty() {
return Err(Error::new(
ast.ident.span(),
"missing `#[options(layers(...))]` attribute",
));
}
Four comments on commit 3488bdf, all on code added this session: 1. (Low) env_parsing.rs module doc claimed the helpers "never call std::env::var", which is no longer true after the driver-level helpers were added. Reworded to describe the actual behavior. 2. (Medium) parse_duration_millis_from_env silently dropped a present-but- unparseable value. Now logs a warn! before falling back to the default, matching the macro's fail-soft semantics and the PR's documented behavior. 3. (Medium) parse_from_env had the same silent-drop gap. Now also warns on a present-but-unparseable value. 4. (Medium) The azure_data_cosmos_macros 0.2.0 CHANGELOG listed env_only and parser but omitted the overridable field flag (part of the 0.2.0 bump rationale). Added the missing Features Added entry.
CI clippy (rust 1.95.0, -D warnings) flagged the 5-tuple return type of parse_option_attrs. Factored the parsed attributes into a named ParsedOptionAttrs struct and destructure it at the call site.
…r_hedging_enablement
Address review feedback on #4562: the PPCB incident kill switch is operator-facing and set exclusively via AZURE_COSMOS_PPCB_ENABLED_OVERRIDE, so it should not be part of the public PartitionFailoverOptions configuration surface alongside with_circuit_breaker_enabled. Remove the public builder setter with_circuit_breaker_enabled_override and demote the circuit_breaker_enabled_override getter to pub(crate); keep the private field and env parsing so the env var still works and the authoritative resolution is unchanged. In-crate tests use a #[cfg(test)] pub(crate) setter to exercise the override without mutating process-wide env.
…r_hedging_enablement
Summary
Adds environment-driven enablement controls for two availability features —
cross-region read hedging and the per-partition circuit breaker (PPCB) — on top
of the existing programmatic configuration, plus an incident
_OVERRIDEkillswitch for each.
AZURE_COSMOS_HEDGING_ENABLED(env layer ofthe normal
OperationOptionslayering) and a top-priority kill switchAZURE_COSMOS_HEDGING_ENABLED_OVERRIDE. Both are implemented generically inthe
CosmosOptionsderive via#[option(env = "...", overridable)].PartitionFailoverOptionsconcern (sincecosmos: expose CosmosRuntime, consolidate client config #4588), so its kill switch is a dedicated
PartitionFailoverOptions::circuit_breaker_enabled_overridefield, set viaAZURE_COSMOS_PPCB_ENABLED_OVERRIDE. When set it is authoritative over boththe
circuit_breaker_enabledoption and the server account propertyenable_per_partition_failover_behavior.Hedging and PPCB remain enabled by default; all new variables are inert unless set.
Resolution layering
hedging_enabled):{ENV}_OVERRIDE → operation → account → runtime → {ENV}.circuit_breaker_enabled_override): override wins overcircuit_breaker_enabledoption and the account property. (Not part of theper-operation layering — PPCB enablement is driver-level.)
All
_OVERRIDEvalues are read once at runtime-build time; flipping mid-incidentrequires a process restart. Booleans parse leniently (
true/false,1/0,yes/no,on/off); an unrecognized value is logged and ignored.Changes
azure_data_cosmos_macrosoverridablefield flag (#[option(env = "...", overridable)]) →auto-generates
{ENV}_OVERRIDEparsing + a top-priorityenv_overrideviewlayer via
new_with_override.#[options(env_only)]struct mode → generates onlyfrom_env()/from_env_vars()(no View/Builder/Default), letting an existing builder typedouble as its own env source.
#[option(env = "...", parser = path)]attribute → customfn(&str) -> Option<T>parsing (e.g.Durationfrom a millisecond count),with lenient None-is-ignored semantics.
0.1.0→0.2.0; driver depends on it bypath.azure_data_cosmos_driver(core)OperationOptions::hedging_enabled: Option<bool>(
#[option(env = "AZURE_COSMOS_HEDGING_ENABLED", overridable)]); a newPriority-0 branch in
resolve_availability_strategyevaluates it beforeavailability_strategy.PartitionFailoverOptions::circuit_breaker_enabled_override(envAZURE_COSMOS_PPCB_ENABLED_OVERRIDE), applied at the two effective-PPCBresolution sites (
PartitionEndpointState::newand theLocationStateStoreaccount-property refresh).
RuntimeEnvConfig,DiagnosticsEnvConfig, andConnectionPoolEnvConfigremoved — the buildersnow read env directly (via
env_only+parser) or through the sharedparse_duration_millis_from_envhelper. Malformed env values are warn-and-ignored (fail-soft); bounds violations still hard-error.
Documentation
ConfigurationOptions.md_OVERRIDEtable updated with both switches and anote that PPCB's override is driver-level (outside the per-operation layering).
azure_data_cosmos,azure_data_cosmos_driver, andazure_data_cosmos_macros.Out of scope
should_hedge) and default threshold areunchanged.