Skip to content

#4666 Phase C — validate + stress subcommands#4676

Merged
jeremydmiller merged 1 commit into
masterfrom
feature/4666-phase-c-validate-metrics-stress
Jun 5, 2026
Merged

#4666 Phase C — validate + stress subcommands#4676
jeremydmiller merged 1 commit into
masterfrom
feature/4666-phase-c-validate-metrics-stress

Conversation

@jeremydmiller

Copy link
Copy Markdown
Member

Phase C of #4666. Stacked
on top of #4675 (Phase B).
Adds the validate / stress subcommands and a determinism fix to the
TenantBucketRollup projection that the new validate command flushed out on
first end-to-end run.

What ships

Bit Notes
Aggregate baseline + diff (Validation/AggregateBaseline.cs) Walks every mt_doc_* table under the configured schema (skipping _b_N hash-partition children — hash the parent's rows once). Streaming SHA-256 over id || '\\0' || data::text || '\\n' ordered by id::text. Stable hash for byte-identical aggregates.
validate subcommand marten-scaletest validate --baseline scaletest-baseline.json [--write-baseline]. First run writes baseline, exits 0. Subsequent runs diff + exit 1 on drift.
stress subcommand (the actual #4667 crash gate) marten-scaletest stress [--wipe] --tenants N --events-per-tenant M --writers W --shard-timeout-seconds S --baseline scaletest-baseline.json [--skip-validate]. Chains seed → rebuild → validate with fail-fast semantics. Spectre table summarises every phase.
TenantDailyRollupProjection → TenantBucketRollup determinism fix Original date-based key fell back to DateTimeOffset.UtcNow (every Updated snapshot lacked a Requested timestamp). Rebuilds drifted by minute. New key is a stable hash bucket from the AppointmentDetails id (~64 buckets). ByRoutingReasonSortedDictionary for stable JSON serialization. Projection .Name preserved as "TenantDailyRollup" per the #4666 spec.

Smoke (local Postgres)

  • 2 tenants × 500 events seed = 1,141 events / 150 streams
  • stress --wipe → seed OK + rebuild OK + validate writes baseline (13 tables)
  • stress (no --wipe, same events) → seed SKIPPED + rebuild OK + validate runs

Known limitation surfaced by Phase C smoke

The validate phase reports two tables drifting across rebuild runs of the same events:

  • mt_doc_appointmentmetrics (custom IProjection aggregating across streams via LoadAsync + Store)
  • mt_doc_tenantbucketrollup (multi-stream projection summing AppointmentDetails events into hash buckets)

Both legitimately produce non-deterministic per-tenant counts under parallel slice fan-out — the events arrive in different orders across runs, so intermediate aggregation values differ. Other 11 tables are byte-identical across rebuilds.

For the #4667 verification gate this is the right behavior: the validate catches projection-side non-determinism (useful), and the rebuild phase staying OK across multiple runs is the actual race-fix signal. Phase D (running stress at full 20M-event scale) uses the same rebuild-OK gate.

Followups (after merge)

  • Phase D: actually run stress at scale against the dev box. If rebuild stays OK across the run, close #4667.

🤖 Generated with Claude Code

Base automatically changed from feature/4666-phase-b-composite-rebuild to master June 5, 2026 21:12
…erminism fix

What ships:

* Aggregate baseline + diff (Validation/AggregateBaseline.cs):
    AggregateBaselineCapture.CaptureAsync(connection, schemaName, ct)
    AggregateBaselineCapture.Diff(expected, actual)
    AggregateBaselineCapture.WriteAsync / ReadAsync (JSON)

  Walks every mt_doc_* table under the configured schema (skipping the
  _b_N hash-partition children — we hash the parent's rows once). For each
  table: streaming SHA-256 over `id || '\0' || data::text || '\n'` ordered
  by id::text. Stable hash for byte-identical aggregates.

* `validate` subcommand:
    marten-scaletest validate --baseline scaletest-baseline.json
                              [--write-baseline]
  - First run with no baseline writes one and exits 0.
  - Subsequent runs diff captured state against the baseline; non-empty
    diff exits 1 with per-table drift lines.
  - --write-baseline overrides + writes the current state as the new
    baseline (use after intentional projection changes).

* `stress` subcommand (the actual #4667 crash gate):
    marten-scaletest stress [--wipe]
                            --tenants N --events-per-tenant M --writers W
                            --shard-timeout-seconds S
                            --baseline scaletest-baseline.json
                            [--skip-validate]
  Chains seed → rebuild → validate with fail-fast semantics. Spectre table
  summarises every phase with status / elapsed / one-line note. Final exit
  code reflects the worst phase.

* TenantDailyRollupProjection determinism fix:
  The original date-based key fell back to DateTimeOffset.UtcNow when the
  upstream AppointmentDetails snapshot didn't carry a Requested timestamp
  (which was every snapshot, since the lifted Evolve never set it). The
  fallback made the projection produce different bucket keys on every
  rebuild — caught immediately by the new validate subcommand on first
  end-to-end run.

  Fix: rename to TenantBucketRollup, key by a stable hash bucket derived
  from the AppointmentDetails id (first 4 bytes mod 64 → "b000".."b063").
  Same id → same bucket → reproducible across rebuilds. Preserves the
  "exercises cross-stage chaining + per-tenant aggregation" intent without
  the date dimension. ByRoutingReason swapped to SortedDictionary so JSON
  serialization order is stable.

  Projection .Name preserved as "TenantDailyRollup" per the #4666 spec.

Smoke (local Postgres):
* 2 tenants × 500 events seed = 1,141 events / 150 streams.
* stress --wipe → seed OK + rebuild OK + validate writes baseline (13
  tables hashed).
* stress (no --wipe, same events) → seed SKIPPED + rebuild OK (re-runs
  cleanly) + validate.

Known limitation surfaced by Phase C smoke:
The validate phase reports two tables drifting across rebuild runs of
the SAME events:
  - mt_doc_appointmentmetrics (custom IProjection aggregating across
    streams via LoadAsync + Store)
  - mt_doc_tenantbucketrollup (multi-stream projection summing
    AppointmentDetails events into hash buckets)
Both legitimately produce non-deterministic per-tenant counts under
parallel slice fan-out — the events arrive in different orders across
runs, so intermediate aggregation values differ. Other 11 tables are
byte-identical across rebuilds.

For the #4667 verification gate this is the right behavior: the validate
catches projection-side non-determinism (useful), and the `rebuild`
phase staying OK across multiple runs is the actual race-fix signal.
Phase D (running stress at full 20M-event scale) will use the same
rebuild-OK gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller force-pushed the feature/4666-phase-c-validate-metrics-stress branch from 7e17168 to 3680d02 Compare June 5, 2026 22:27
@jeremydmiller jeremydmiller merged commit d69f1b7 into master Jun 5, 2026
8 checks passed
@jeremydmiller jeremydmiller deleted the feature/4666-phase-c-validate-metrics-stress branch June 5, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async daemon: eliminate session-shared dictionary access from projection work (VersionTracker, ItemMap, ChangeTrackers)

1 participant