Skip to content

cosmos: expose CosmosRuntime, consolidate client config#4588

Merged
analogrelay merged 20 commits into
Azure:mainfrom
analogrelay:ashleyst/runtime-at-sdk
Jun 17, 2026
Merged

cosmos: expose CosmosRuntime, consolidate client config#4588
analogrelay merged 20 commits into
Azure:mainfrom
analogrelay:ashleyst/runtime-at-sdk

Conversation

@analogrelay

Copy link
Copy Markdown
Member

Promotes the driver's CosmosDriverRuntime to a first-class SDK concept as CosmosRuntime, and re-shapes CosmosClientBuilder so that per-process concerns (transport, TLS, proxy, UA defaults) live on the runtime while per-client concerns (operation defaults, fault injection, throughput-control groups, partition-failover tuning) stay on the builder. Along the way, a handful of related config knobs are consolidated into proper nested option groups (PartitionFailoverOptions, ThroughputControlOptions), the partition-failover env vars get a single AZURE_COSMOS_PPCB_* prefix, and the EmulatorServerCertValidation enum is renamed and reshaped to a safer ServerCertificateValidation with an emulator-aware RequiredUnlessEmulator policy.

See the azure_data_cosmos and azure_data_cosmos_driver CHANGELOGs for the full surface-level migration catalog, including every renamed, removed, and relocated public item.

analogrelay and others added 14 commits June 12, 2026 21:17
…efault_operation_options

Renames `CosmosDriverRuntimeBuilder::with_operation_options` to
`with_default_operation_options` and the matching runtime accessors
`CosmosDriverRuntime::operation_options()` →
`default_operation_options()` and `set_operation_options()` →
`set_default_operation_options()`. The new names make the runtime-layer
role explicit and match the option-resolution hierarchy
(per-op → per-driver → runtime → env → built-in default).

`DriverOptions{,Builder}::operation_options()` and
`DriverOptionsBuilder::with_operation_options` are unchanged — those
are the per-driver layer and keep their plain names.

Updates all in-driver call sites, the SDK's CosmosClientBuilder forwarding
in cosmos_client_builder.rs, the partition-level failover spec doc, and
in-driver tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a per-driver `user_agent_suffix` override on `DriverOptions`. The
runtime continues to precompute a single `UserAgent` and now stores it
behind `Arc<UserAgent>`. When a driver opts in via
`DriverOptionsBuilder::with_user_agent_suffix`, the driver computes its
own `UserAgent` from the runtime's wrapping-SDK identifier and the
driver-level suffix and stores it in its own `Arc`. When the override
is unset, the driver clones the runtime's `Arc<UserAgent>` — drivers
sharing a runtime without overrides share the same allocation (verified
by a new `Arc::ptr_eq` test).

`CosmosDriver` now stamps requests using its own `user_agent` field
(both on the data-plane hot path and through metadata refresh paths).
The metadata-refresh closure captured by `LocationStateStore` now also
captures the driver's `Arc<UserAgent>` so refresh requests carry the
driver's identity. Bootstrap requests (account-properties probes before
any `CosmosDriver` exists) keep using the runtime's User-Agent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename `CosmosDriverRuntime::get_or_create_driver` to `create_driver`. The
runtime no longer caches drivers per account; each call returns a fresh
`CosmosDriver` wrapped in an `Arc`. Direct consumers of the driver runtime
(currently only `azure_data_cosmos`) are now responsible for any sharing
they want; the SDK pattern remains "one CosmosClient => one CosmosDriver per
build, with the underlying runtime shared via `CosmosClientBuilder::with_runtime".

Updated all in-driver and SDK call sites, tests, doc examples, README, and
ARCHITECTURE.md (per-account-cache claim replaced with the new model).
Extended the driver test framework so per-operation helpers inherit
`preferred_regions` configured by `run_with_unique_db_and_hedging` (formerly
relied on the pre-warmed cache).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fault injection rules are now configured per-driver via `DriverOptionsBuilder::with_fault_injection_rules`, not on the shared `CosmosDriverRuntime`. Each `CosmosDriver` owns its own (potentially FI-wrapped) HTTP client factory: drivers without rules cheaply clone the runtime's factory `Arc`; drivers with rules wrap it once with `FaultInjectingHttpClientFactory`. This enables true per-driver fault-injection isolation across clients sharing a runtime.

`CosmosDriverRuntime{,Builder}::with_fault_injection_rules` and the runtime-level `fault_injection_enabled` flag are removed. The diagnostic flag is now per-driver. The SDK's `CosmosClientBuilder::with_fault_injection` continues to work; rules flow through `build_driver_options` onto the per-driver options.

Test framework migrated: `DriverTestClient` now stores FI rules and applies them per-operation driver via the new `DriverTestRunContext::driver_options()` helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Throughput-control-group registrations are now configurable at both the
runtime layer (shared defaults across all drivers using the same runtime)
and the driver layer (per-client extensions via DriverOptionsBuilder).

CosmosDriver merges the runtime and driver registries at construction. The
merge is additive: cross-layer (container, name) collisions error before
the driver becomes visible. Within-builder collisions still error at
register-time. Mutable settings (priority level, throughput bucket)
propagate across layers via Arc<RwLock<...>>, so callers holding either
reference observe the same updated state.

CosmosDriver::new now returns Result; create_driver propagates the error
with the existing CLIENT_THROUGHPUT_CONTROL_GROUP_REGISTRATION_FAILED
status. All TCG lookups during request processing now route through the
driver's merged registry instead of the runtime's.

SDK CosmosClientBuilder::with_throughput_control_group now flows
registrations into the per-driver options instead of the runtime. SDK
clients sharing a runtime no longer inherit each other's TCG
registrations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce CosmosRuntime and CosmosRuntimeBuilder in azure_data_cosmos as
thin newtypes around the driver crate's CosmosDriverRuntime and
CosmosDriverRuntimeBuilder. The SDK builder exposes only the runtime-shaped
options end users are likely to touch: connection pool, default operation
options, user-agent suffix, CPU refresh interval, and throughput-control
groups. Runtime-only diagnostics options (workload id, correlation id) stay
on the driver builder for advanced consumers.

CosmosRuntime::global() returns a lazily-initialized process-wide default
runtime backed by async_lock::OnceCell. It honors
AZURE_COSMOS_PER_PARTITION_CIRCUIT_BREAKER_ENABLED, applies the SDK's
wrapping-SDK identifier (azsdk-rust-cosmos/<version>), and — when the
allow_invalid_certificates Cargo feature is enabled — defaults emulator
server certificate validation to DangerousDisabled.

CosmosRuntimeBuilder::build() always applies the wrapping-SDK identifier,
so even custom runtimes are correctly attributed to this crate. A
doc-hidden CosmosRuntimeBuilder::from_driver_builder escape hatch (gated on
__internal_in_memory_emulator) lets the test harness inject mock HTTP
factories while still getting the SDK's identifier wired up.

Also re-export ConnectionPoolOptions, ConnectionPoolOptionsBuilder, and
EmulatorServerCertValidation from azure_data_cosmos::options so users can
configure custom runtimes without referencing the driver crate directly.

The new types are wired up in lib.rs but are not yet referenced by
CosmosClientBuilder — that integration lands in the next change, when the
SDK builder gains with_runtime() and falls back to global() on build().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewire CosmosClientBuilder around the new CosmosRuntime model. The
builder now resolves a runtime at build() time (using the supplied one or
falling back to CosmosRuntime::global()), then constructs per-driver
DriverOptions carrying everything the client wants to override on top of
the runtime: default OperationOptions, user-agent suffix override,
fault-injection rules, and throughput-control groups.

API changes (breaking):

- Add with_runtime(CosmosRuntime), with_default_operation_options(OperationOptions).
- Rename with_fault_injection -> with_fault_injection_rules (now -> Result<Self>).
- Rename with_throughput_control_group -> register_throughput_control_group
  (now -> Result<Self>).
- Remove with_proxy_allowed, with_allow_emulator_invalid_certificates,
  with_throttling_retry_options, and the doc-hidden with_driver_runtime_builder
  in favor of a custom CosmosRuntime via with_runtime.
- Drop the corresponding fields from CosmosClientOptions: allow_proxy,
  allow_emulator_invalid_certificates, throughput_control_groups buffer,
  fault_injection_rules buffer, driver_runtime_builder cell.

Per-client UA suffix now propagates through DriverOptions::user_agent_suffix
rather than the runtime builder, matching the runtime/driver layered model.

Tests:

- Updated tests/framework/test_client.rs: removed with_allow_emulator_invalid_certificates
  calls (the global runtime handles this via the allow_invalid_certificates
  Cargo feature now), renamed with_fault_injection -> with_fault_injection_rules.
- Updated tests/emulator_tests/cosmos_proxy.rs: build a custom CosmosRuntime
  with a ConnectionPoolOptions configured for proxy + emulator-cert
  relaxation, attach via with_runtime.
- Updated tests/emulator_tests/cosmos_backup_endpoints.rs: dropped the
  per-client cert flag (handled by the global runtime).
- Updated tests/in_memory_emulator_tests/{end_to_end,user_agent}.rs:
  every with_driver_runtime_builder call now routes through
  CosmosRuntimeBuilder::from_driver_builder(...).build().await + with_runtime.
- Replaced one with_throttling_retry_options site with
  with_default_operation_options containing the same throttling settings.

Docs:

- Updated fault_injection/mod.rs doc-comment example and prose to use
  with_fault_injection_rules.
- Updated docs/sdk-to-driver-cutover.md and docs/ConfigurationOptions.md
  to reflect the new builder names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds entries to both CHANGELOGs for the runtime-options refactor:

- azure_data_cosmos_driver: per-driver UA suffix override, per-driver
  fault-injection rules, per-driver throughput-control group registry
  (additive merge), and the breaking renames/removals on
  CosmosDriverRuntime{,Builder} (with_operation_options ->
  with_default_operation_options, removal of the per-account driver
  cache + get_or_create_driver -> create_driver, removal of runtime-
  level fault injection, registry lookup move from runtime to driver,
  CosmosDriver::new now returns Result).

- azure_data_cosmos: new CosmosRuntime / CosmosRuntimeBuilder types
  (with full delegating-setter surface), CosmosRuntime::global(), new
  ConnectionPoolOptions{,Builder} and EmulatorServerCertValidation
  re-exports, and the slim CosmosClientBuilder surface (added
  with_runtime / with_default_operation_options /
  with_fault_injection_rules / register_throughput_control_group;
  removed with_proxy_allowed / with_allow_emulator_invalid_certificates
  / with_throttling_retry_options / with_fault_injection /
  with_throughput_control_group / with_driver_runtime_builder). The
  allow_invalid_certificates Cargo feature is now scoped to
  CosmosRuntime::global()'s default cert-validation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Options

Removes the throughput-control-group registry from CosmosDriverRuntime
entirely — TCGs are now a driver-level concern. The SDK's
CosmosRuntimeBuilder::register_throughput_control_group is removed (use
CosmosClientBuilder::register_throughput_control_group).

Adds a nested ThroughputControlOptions group on
OperationOptions::throughput_control mirroring the ThrottlingRetryOptions
pattern, with three independently layered fields: group_name (replaces
the old top-level OperationOptions::throughput_control_group), and direct
throughput_bucket / priority_level overrides that emit wire headers
without requiring a registered group. Final header resolution per field:
direct value wins, else resolved group_name lookup, else omit. The
implicit "default group for container" fallback on the request path is
removed.

Pipeline contexts now carry a small, Copy ResolvedThroughputControl by
value instead of an Option<&ThroughputControlGroupSnapshot>, removing
lifetime juggling in the attempt / hedge context structs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 21:44
@analogrelay analogrelay requested a review from a team as a code owner June 12, 2026 21:44
@analogrelay analogrelay changed the title feat(cosmos)!: expose CosmosRuntime, consolidate client config cosmos: expose CosmosRuntime, consolidate client config Jun 12, 2026
@github-actions github-actions Bot added the Cosmos The azure_cosmos crate label Jun 12, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR promotes a shared Cosmos “runtime” concept to the SDK surface (CosmosRuntime) and restructures configuration layering so process-wide concerns (transport / TLS / proxy / UA defaults) live on the runtime while per-client/per-driver concerns (default operation options, fault injection, throughput-control groups, partition-failover tuning) live on the client/driver builder surfaces.

Changes:

  • Adds CosmosRuntime / CosmosRuntimeBuilder to azure_data_cosmos and rewires CosmosClientBuilder to consume a runtime and produce per-client driver options.
  • Refactors azure_data_cosmos_driver to remove per-account driver caching (get_or_create_drivercreate_driver) and move fault injection / throughput-control groups / partition-failover tuning to DriverOptions.
  • Consolidates configuration into nested option groups (e.g., OperationOptions::throughput_control, PartitionFailoverOptions) and renames emulator TLS policy to ServerCertificateValidation.
Show a summary per file
File Description
sdk/cosmos/azure_data_cosmos/tests/in_memory_emulator_tests/user_agent.rs Updates tests to build SDK clients using CosmosRuntimeBuilder and with_runtime.
sdk/cosmos/azure_data_cosmos/tests/in_memory_emulator_tests/end_to_end.rs Updates emulator E2E tests for runtime-based construction and new operation-options wiring.
sdk/cosmos/azure_data_cosmos/tests/in_memory_emulator_tests/dual_backend.rs Adapts dual-backend test harness to create_driver and new TLS validation policy type.
sdk/cosmos/azure_data_cosmos/tests/in_memory_emulator_tests/driver_end_to_end.rs Updates driver-focused emulator tests for per-driver fault injection and create_driver.
sdk/cosmos/azure_data_cosmos/tests/framework/test_client.rs Updates shared SDK test framework to configure emulator TLS via CosmosRuntime and new FI registration API.
sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_proxy.rs Reworks proxy test to configure proxy + TLS validation via runtime connection pool options.
sdk/cosmos/azure_data_cosmos/tests/emulator_tests/cosmos_backup_endpoints.rs Updates backup-endpoint test (but currently drops explicit emulator TLS runtime wiring).
sdk/cosmos/azure_data_cosmos/src/runtime.rs Introduces the SDK-level CosmosRuntime wrapper over the driver runtime and its builder.
sdk/cosmos/azure_data_cosmos/src/options/mod.rs Re-exports additional driver option types to support runtime/client configuration without direct driver dependency.
sdk/cosmos/azure_data_cosmos/src/lib.rs Exposes CosmosRuntime / CosmosRuntimeBuilder at the crate root.
sdk/cosmos/azure_data_cosmos/src/fault_injection/mod.rs Updates fault-injection module docs to reflect the new per-driver FI registration flow.
sdk/cosmos/azure_data_cosmos/src/constants.rs Removes an SDK env-var constant that’s now consolidated under AZURE_COSMOS_PPCB_*.
sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs Major refactor: runtime attachment, per-client defaults, partition-failover options, per-driver FI & throughput groups.
sdk/cosmos/azure_data_cosmos/docs/sdk-to-driver-cutover.md Updates design notes/docs for the new FI wiring and runtime/client split.
sdk/cosmos/azure_data_cosmos/docs/ConfigurationOptions.md Updates configuration docs to match the new “defaults via operation options/runtime” model.
sdk/cosmos/azure_data_cosmos/CHANGELOG.md Documents new runtime surface and migration notes for builder/config changes.
sdk/cosmos/azure_data_cosmos/Cargo.toml Removes the allow_invalid_certificates feature from the SDK crate metadata/features.
sdk/cosmos/azure_data_cosmos_driver/tests/multi_write_tests/driver_partition_failover.rs Moves PPCB enable/tuning tests to PartitionFailoverOptions (driver-level).
sdk/cosmos/azure_data_cosmos_driver/tests/multi_region_tests/driver_partition_failover.rs Same as above for multi-region scenarios.
sdk/cosmos/azure_data_cosmos_driver/tests/multi_region_failover.rs Updates multi-region failover tests for per-driver FI and create_driver.
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/throttling.rs Updates in-memory emulator throttling tests for runtime defaults + per-driver FI.
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/hedging.rs Updates hedging tests for driver-only options (PartitionFailoverOptions, create_driver).
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/excluded_regions_fallback.rs Switches to create_driver + DriverOptions construction.
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/error_diagnostics.rs Switches to create_driver + DriverOptions construction.
sdk/cosmos/azure_data_cosmos_driver/tests/in_memory_emulator_tests/account_metadata_refresh.rs Switches to create_driver + DriverOptions construction.
sdk/cosmos/azure_data_cosmos_driver/tests/gateway_query_plan_comparison.rs Switches to create_driver + DriverOptions construction.
sdk/cosmos/azure_data_cosmos_driver/tests/framework/test_client.rs Refactors driver test harness to apply preferred regions/FI/PPCB via per-operation DriverOptions.
sdk/cosmos/azure_data_cosmos_driver/tests/emulator_tests/driver_partition_failover.rs Moves PPCB enable/tuning to PartitionFailoverOptions and new harness entrypoint.
sdk/cosmos/azure_data_cosmos_driver/tests/emulator_tests/driver_backup_endpoints.rs Updates to use create_driver + DriverOptions.
sdk/cosmos/azure_data_cosmos_driver/tests/emulator_tests/driver_account_metadata_failover.rs Updates docs/comments and expectations for create_driver bootstrap path.
sdk/cosmos/azure_data_cosmos_driver/src/options/throughput_control.rs Adds resolved throughput-control struct and registry merge support.
sdk/cosmos/azure_data_cosmos_driver/src/options/policies.rs Renames/reshapes TLS validation policy enum and emulator detection hook.
sdk/cosmos/azure_data_cosmos_driver/src/options/partition_failover.rs Adds new driver-level PartitionFailoverOptions with env-var parsing under AZURE_COSMOS_PPCB_*.
sdk/cosmos/azure_data_cosmos_driver/src/options/operation_options.rs Introduces nested ThroughputControlOptions and removes PPCB knobs from per-operation options.
sdk/cosmos/azure_data_cosmos_driver/src/options/mod.rs Wires new option modules/exports (PartitionFailoverOptions, ThroughputControlOptions, ServerCertificateValidation).
sdk/cosmos/azure_data_cosmos_driver/src/options/identity.rs Adjusts identity docs (minor cleanup).
sdk/cosmos/azure_data_cosmos_driver/src/options/env_parsing.rs Improves validation error field naming across multiple env-var groups.
sdk/cosmos/azure_data_cosmos_driver/src/options/driver_options.rs Expands DriverOptions to carry per-driver UA suffix, FI rules, throughput groups, PPCB tuning.
sdk/cosmos/azure_data_cosmos_driver/src/options/connection_pool.rs Renames emulator TLS toggle to ServerCertificateValidation and updates env-var mapping.
sdk/cosmos/azure_data_cosmos_driver/src/models/cosmos_operation.rs Updates docs/examples to use create_driver(DriverOptions...).
sdk/cosmos/azure_data_cosmos_driver/src/in_memory_emulator/client.rs Updates docs/examples to use create_driver(DriverOptions...).
sdk/cosmos/azure_data_cosmos_driver/src/driver/transport/mod.rs Switches emulator TLS decision logic to new ServerCertificateValidation API.
sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs Removes driver cache, renames operation-defaults APIs, changes UA storage to Arc, removes runtime-level FI/groups.
sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/routing_systems.rs Replaces internal partition-failover config with PartitionFailoverOptions getters.
sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/partition_endpoint_state.rs Removes internal PartitionFailoverConfig and threads PartitionFailoverOptions through routing state.
sdk/cosmos/azure_data_cosmos_driver/src/driver/routing/location_state_store.rs Threads new partition-failover options type through store + failback loop.
sdk/cosmos/azure_data_cosmos_driver/src/driver/pipeline/operation_pipeline.rs Switches from group snapshot to resolved throughput-control header inputs per request.
sdk/cosmos/azure_data_cosmos_driver/README.md Updates README, but currently includes literal diff markers in a code block.
sdk/cosmos/azure_data_cosmos_driver/docs/PARTITION_LEVEL_FAILOVER_SPEC.md Updates spec snippet for renamed runtime default-operation-options accessor.
sdk/cosmos/azure_data_cosmos_driver/CHANGELOG.md Documents the driver-side option layering restructure and breaking API changes.
sdk/cosmos/azure_data_cosmos_driver/ARCHITECTURE.md Updates architecture docs for create_driver and no driver caching.
sdk/cosmos/azure_data_cosmos_benchmarks/src/lib.rs Updates benchmarks to use create_driver(DriverOptions...).
sdk/cosmos/.github/skills/cosmos-pre-commit-validation/SKILL.md Updates local test command examples to remove the deleted allow_invalid_certificates feature.

Copilot's findings

  • Files reviewed: 54/54 changed files
  • Comments generated: 5

Comment thread sdk/cosmos/azure_data_cosmos/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs
Comment thread sdk/cosmos/azure_data_cosmos_driver/README.md
Comment thread sdk/cosmos/azure_data_cosmos/CHANGELOG.md
Restructured the SDK and driver CHANGELOG entries for the runtime/options
refactor so the whole branch reads as one Features Added bullet (with
sub-bullets per option group) and one Breaking Changes bullet (with a
catalog of every renamed / removed / relocated public surface) per
crate, matching the "1-2 entries per PR" repo convention.

- azure_data_cosmos (0.35.0):
  - Restored the cross-regional hedging (Azure#4432) and throttling-retry
    (Azure#4544) entries that earlier passes had dropped, updating their
    CosmosClientBuilder API references to use with_default_operation_options
    instead of the removed with_operation_options /
    with_throttling_retry_options setters.
  - Folded the previous 4 Features Added entries (runtime types, re-exports,
    PartitionFailoverOptions, ThroughputControlOptions) into one bullet with
    sub-sections for runtime/client setters, re-exports, and the new nested
    throughput_control group.
  - Folded the slimmed-builder Breaking Changes into one entry that lists
    every removed / renamed setter side-by-side, plus the
    allow_invalid_certificates feature removal with its
    ServerCertificateValidation::RequiredUnlessEmulator migration path.
- azure_data_cosmos_driver (0.4.0):
  - Updated the existing Azure#4432 entry to reference
    PartitionFailoverOptions::consecutive_hedge_win_threshold (the field's
    new location) instead of the now-removed OperationOptions field.
  - Folded the previous 3 Features Added entries (DriverOptionsBuilder
    overrides, PartitionFailoverOptions, ThroughputControlOptions) into a
    single bullet with sub-sections.
  - Folded the seven Breaking Changes entries about the restructure (runtime
    rename, driver cache, runtime FI, runtime TCG registry,
    throughput_control_group removal, OperationOptions PPCB-field removals,
    cert-validation rename, env-var renames) into one Migration impact
    bullet with sub-sections. Retained the pre-existing entries for
    resolve_container_by_rid (Azure#4506) and PartitionKey::EMPTY removal.

Also drops the stale "merges with runtime-layer registry" rustdoc on
CosmosClientBuilder::register_throughput_control_group; the runtime
registry no longer exists, so the doc now describes the driver-only
scope with build()-time duplicate detection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@analogrelay analogrelay force-pushed the ashleyst/runtime-at-sdk branch from 55bbff6 to 52d7c7e Compare June 12, 2026 22:17
analogrelay and others added 2 commits June 15, 2026 18:13
- Remove the `CosmosClientBuilder::with_throttling_retry_options`
  helper that the CHANGELOG already documented as removed. Throttle
  retry configuration goes through `with_default_operation_options` (or
  a `CosmosRuntime` for process-wide defaults).
- Clean up unresolved `-`/`+` diff markers in the driver README's
  Usage example.
- Promote `allow_invalid_certificates` from an env-var-only hack into
  an explicit `TestOptions` opt-in (`TestOptions::for_emulator()`).
  Each emulator-only test in `tests/emulator_tests/` now opts into the
  relaxed runtime explicitly; live and multi-write tests are
  untouched. The existing `AZURE_COSMOS_CONNECTION_STRING=emulator`
  shorthand still auto-relaxes for backward compatibility.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@analogrelay analogrelay force-pushed the ashleyst/runtime-at-sdk branch from 52d7c7e to 33191ba Compare June 15, 2026 18:14
@kundadebdatta

Copy link
Copy Markdown
Member

🟡 PR Deep Review · 5 inline findings

Reviewed this runtime/config refactor in depth, including the history of #4147 (driver cache), #4252 (TLS feature gating), and #4156 (PPCB env vars). Overall this is a strong, well-executed breaking refactor -- the env rename is complete and consistent, the new CosmosRuntime newtype is clean, and the TLS redesign is safer by construction. I have posted 5 inline 🟡 Recommendations:

  1. PPCB default silently flips ON -> OFF (cosmos_client_builder.rs) -- the unwrap_or(true) "historical default" block was deleted, its regression tests removed, and it is undocumented; compounds with the env-var rename (no back-compat alias).
  2. allow_invalid_certificates promoted to default (policies.rs) -- removes the compile-time cert-bypass backstop that Cosmos: Add a native-tls feature to allow configuring native-tls AND still getting relevant config APIs #4252 deliberately kept.
  3. Stale "merges the runtime's registry" docs + test-only extend() (throughput_control.rs).
  4. Deleted security-critical TLS gating tests (transport/mod.rs) -- allows_insecure_connection has no coverage.
  5. with_fault_injection_rules # Errors doc is false (cosmos_client_builder.rs).

Items 1 and 2 are the ones I would most want resolved or explicitly signed off before merge. Separately: the existing Copilot comment that with_throttling_retry_options "is still present" appears to be a false positive -- it is removed from CosmosClientBuilder (the bot likely matched a different type's setter), and the CHANGELOG is accurate.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

Comment thread sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/options/policies.rs
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/options/throughput_control.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/transport/mod.rs
Comment thread sdk/cosmos/azure_data_cosmos/src/clients/cosmos_client_builder.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos/CHANGELOG.md Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/options/partition_failover.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/runtime.rs
@analogrelay analogrelay force-pushed the ashleyst/runtime-at-sdk branch from 9701a12 to 971fa3b Compare June 16, 2026 18:54

@kundadebdatta kundadebdatta left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can discuss the allow_invalid_certificates separately.

@github-project-automation github-project-automation Bot moved this from Todo to Approved in CosmosDB Rust SDK and Driver Jun 17, 2026
@analogrelay

Copy link
Copy Markdown
Member Author

Yep, it's straightforward to add back.

@FabianMeiswinkel FabianMeiswinkel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@analogrelay analogrelay merged commit ccc36a3 into Azure:main Jun 17, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from Approved to Done in CosmosDB Rust SDK and Driver Jun 17, 2026
simorenoh added a commit that referenced this pull request Jun 17, 2026
Integrate the Public API Reorganization (#4512) and runtime/options
restructure (#4588) from main with the RID-addressing work on this branch.

Conflict resolutions:
- lib.rs / cosmos_client.rs / container_client.rs: adopt main's reorganized
  module layout and import grouping, re-adding the RID surface
  (resource_identity module, ResourceId/ResourceIdentity, ResourceIdentity
  imports).
- database_client.rs: keep the RID-aware resource_id() short-circuit; route
  throughput methods through it and delegate the missing-_rid error path to
  main's resource_id_or_error helper.
- cosmos_status.rs: drop the duplicate 20306 status; main's generic
  SERVICE_RETURNED_OBJECT_WITHOUT_RID is now canonical.
- cosmos_driver.rs: keep fetch_container_by_rid; adopt main's fallible
  CosmosDriver::new signature.
- CHANGELOGs: combine both crates' Unreleased entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
kundadebdatta pushed a commit that referenced this pull request Jun 17, 2026
The two from_options_env_override_* tests referenced OperationOptions::per_partition_circuit_breaker_enabled and PartitionFailoverConfig::from_options, both removed by #4588 when PPCB enablement moved to driver-level PartitionFailoverOptions. They were dead code carried in by the main merge and broke 'cargo clippy --all-targets' (lib test) compilation.
kundadebdatta pushed a commit that referenced this pull request Jun 17, 2026
…view)

Addresses analogrelay review comment on RuntimeEnvConfig: 'Might want to align this with changes coming in #4588.'

Removes the bespoke RuntimeEnvConfig CosmosOptions macro struct and routes AZURE_COSMOS_CPU_REFRESH_INTERVAL_MS through the shared parse_duration_millis_from_env helper, matching how #4588's driver-level PartitionFailoverOptions builder reads its duration env vars. All duration env vars now resolve on one path. Drops the now-obsolete RuntimeEnvConfig field-mapping unit test (the helper is covered by env_parsing tests) and updates the driver CHANGELOG to reflect the actual consolidation scope.
kundadebdatta pushed a commit that referenced this pull request Jun 17, 2026
…Options (PR #4562)

The #4588 merge moved PPCB enablement off OperationOptions onto driver-level PartitionFailoverOptions, which dropped this PR's PPCB _OVERRIDE kill switch while the changelogs/docs still advertised it. This restores the feature on its new home.

Adds PartitionFailoverOptions::circuit_breaker_enabled_override (env AZURE_COSMOS_PPCB_ENABLED_OVERRIDE, lenient boolean via new parse_optional_bool_from_env helper) plus the matching builder setter. The override is authoritative over BOTH the circuit_breaker_enabled option and the account property enable_per_partition_failover_behavior, applied at the two effective-PPCB resolution sites (PartitionEndpointState::new and LocationStateStore account-property refresh). Updates both CHANGELOGs and ConfigurationOptions.md to the correct env-var name/scope, and adds unit tests covering both override directions at the options and routing-state layers.
kundadebdatta pushed a commit that referenced this pull request Jun 18, 2026
…nges

The merge-preview against origin/main surfaced pre-existing breakage: #4588 renamed CosmosDriverRuntime::get_or_create_driver to create_driver (now taking only DriverOptions, account embedded), made PartitionFailoverOptions fields private, and removed the per-operation PPCB knobs from OperationOptions. Several inherited driver tests still used the old APIs and only compiled behind feature gates, so CI's --all-features --all-targets clippy fails.

Fixes: migrate the three create_driver call sites to the new single-arg signature; migrate the PPCB override tests to configure thresholds via driver-level PartitionFailoverOptions (new build_ppcb_fixture with a 1-failure threshold; build_fixture delegates with None so non-PPCB tests are unchanged); cast write_failure_threshold() (u32) to i32 for the write_failure_count field; add a test-only PartitionFailoverConfig alias to keep inherited operation_pipeline tests compiling.
kundadebdatta pushed a commit that referenced this pull request Jun 18, 2026
…ture

Addresses @analogrelay's review on #4623. The dataflow pipeline fans a query into per-physical-partition sub-operations and stamps the owning partition_key_range_id (plus narrowed feed range / partition key) onto OperationOverrides rather than mutating the shared CosmosOperation. pre_resolve_partition_key_range_id previously inspected only the CosmosOperation, so for EPK-range query sub-ops it re-resolved overlapping ranges from scratch and could collapse to None on a multi-range match -- silently dropping the PPCB/PPAF seed.

Fix: pass OperationOverrides into pre_resolve_partition_key_range_id and consult it first -- (1) use overrides.partition_key_range_id directly when present (no cache lookup, no multi-range collapse), (2) else resolve a logical partition key from overrides/operation, (3) else resolve an EPK range from overrides/operation, seeding only on a single owning partition.

Also fixes the W2 in-memory-emulator failure: the non-PPCB region-failover fixture (build_fixture) was left on the driver-default circuit_breaker_enabled=true after the #4588 PPCB-config migration, so pre-resolution issued an unfaulted pkranges read to the primary that polluted the recorded host list. build_fixture now explicitly disables PPCB; build_ppcb_fixture keeps it enabled with a 1-failure threshold.
kundadebdatta pushed a commit that referenced this pull request Jun 18, 2026
PRs #4588 and #4590 had "semantic" merge conflicts. They didn't actually
conflict with each other at the diff level, but new tests added by #4590
used APIs that #4588 renamed or moved. Because we don't enforce
up-to-date branches before merge, they both landed in main and broke it.

This PR resolves that conflict and fixes the tests from #4590 to use the
correct APIs.
kundadebdatta added a commit that referenced this pull request Jun 19, 2026
…PPCB, plus `AZURE_COSMOS_HEDGING_ENABLED` master switch (#4562)

## Summary

Adds environment-driven enablement controls for two availability
features —
cross-region read hedging and the per-partition circuit breaker (PPCB) —
on top
of the existing programmatic configuration, plus an incident `_OVERRIDE`
kill
switch for each.

- **Hedging** gains a master switch `AZURE_COSMOS_HEDGING_ENABLED` (env
layer of
  the normal `OperationOptions` layering) and a top-priority kill switch
`AZURE_COSMOS_HEDGING_ENABLED_OVERRIDE`. Both are implemented
generically in
  the `CosmosOptions` derive via `#[option(env = "...", overridable)]`.
- **PPCB** enablement is a driver-level `PartitionFailoverOptions`
concern (since
  #4588), so its kill switch is a dedicated
`PartitionFailoverOptions::circuit_breaker_enabled_override` field, set
via
`AZURE_COSMOS_PPCB_ENABLED_OVERRIDE`. When set it is authoritative over
**both**
  the `circuit_breaker_enabled` option and the server account property
  `enable_per_partition_failover_behavior`.

Hedging and PPCB remain enabled by default; all new variables are inert
unless set.

## Resolution layering

- **Hedging** (`hedging_enabled`): `{ENV}_OVERRIDE → operation → account
→ runtime → {ENV}`.
- **PPCB** (`circuit_breaker_enabled_override`): override wins over
`circuit_breaker_enabled` option **and** the account property. (Not part
of the
  per-operation layering — PPCB enablement is driver-level.)

All `_OVERRIDE` values are read once at runtime-build time; flipping
mid-incident
requires a process restart. Booleans parse leniently (`true/false`,
`1/0`,
`yes/no`, `on/off`); an unrecognized value is logged and ignored.

## Changes

### `azure_data_cosmos_macros`
- New `overridable` field flag (`#[option(env = "...", overridable)]`) →
auto-generates `{ENV}_OVERRIDE` parsing + a top-priority `env_override`
view
  layer via `new_with_override`.
- New `#[options(env_only)]` struct mode → generates only `from_env()`/
`from_env_vars()` (no View/Builder/Default), letting an existing builder
type
  double as its own env source.
- New `#[option(env = "...", parser = path)]` attribute → custom
`fn(&str) -> Option<T>` parsing (e.g. `Duration` from a millisecond
count),
  with lenient None-is-ignored semantics.
- Crate version `0.1.0` → `0.2.0`; driver depends on it by `path`.

### `azure_data_cosmos_driver` (core)
- `OperationOptions::hedging_enabled: Option<bool>`
(`#[option(env = "AZURE_COSMOS_HEDGING_ENABLED", overridable)]`); a new
Priority-0 branch in `resolve_availability_strategy` evaluates it before
  `availability_strategy`.
- `PartitionFailoverOptions::circuit_breaker_enabled_override` (env
`AZURE_COSMOS_PPCB_ENABLED_OVERRIDE`), applied at the two effective-PPCB
resolution sites (`PartitionEndpointState::new` and the
`LocationStateStore`
  account-property refresh).
- Options-layering cleanup (per review): `RuntimeEnvConfig`,
`DiagnosticsEnvConfig`, and `ConnectionPoolEnvConfig` removed — the
builders
now read env directly (via `env_only` + `parser`) or through the shared
`parse_duration_millis_from_env` helper. Malformed env values are
warn-and-
  ignored (fail-soft); bounds violations still hard-error.

### Documentation
- `ConfigurationOptions.md` `_OVERRIDE` table updated with both switches
and a
note that PPCB's override is driver-level (outside the per-operation
layering).
- CHANGELOG entries in `azure_data_cosmos`, `azure_data_cosmos_driver`,
and
  `azure_data_cosmos_macros`.

## Out of scope
- PPAF (server-driven) enablement is unchanged.
- The multi-region eligibility gate (`should_hedge`) and default
threshold are
  unchanged.

---------

Co-authored-by: kundadebdatta <kunda.debdatta@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants