Skip to content

#4684 Phase E.2 — progression lock-wait sampler#4692

Merged
jeremydmiller merged 2 commits into
masterfrom
feature/4684-phase-e2-lockwait
Jun 8, 2026
Merged

#4684 Phase E.2 — progression lock-wait sampler#4692
jeremydmiller merged 2 commits into
masterfrom
feature/4684-phase-e2-lockwait

Conversation

@jeremydmiller

Copy link
Copy Markdown
Member

Second slice of #4684 (item 5 of the parent spec). Marten-side only, no core hooks needed.

Why

PostgreSQL doesn't expose cumulative per-row lock-wait history -- pg_stat_activity shows the current wait of each session. The sampler approximates by polling at the existing --instrument-sample-seconds cadence and counting ungranted lock holders on mt_event_progression that aren't our own session:

WHERE a.state = 'active'
  AND a.wait_event_type IS NOT NULL
  AND a.pid != pg_backend_pid()
  AND EXISTS (
      SELECT 1 FROM pg_locks l
      JOIN pg_class c ON c.oid = l.relation
      WHERE l.pid = a.pid
        AND c.relname = 'mt_event_progression'
        AND l.granted = false)

Per-sample (timestamp, waiter_count, max_wait_ms) rolls up at end-of-run to three numbers:

Metric Meaning
MaxConcurrentWaiters Peak count of ungranted lock holders observed in any single sample
MaxSingleWaitMs Longest wait observed in any single sample
ObservedWaiterSeconds sum(waiter_count_per_sample) * sample_interval -- approximation of cumulative time spent waiting

Zero across all three signals no contention; non-zero signals the new concurrent-rebuild cap (jasperfx#420 in 2.9.0) is being exercised, or that something outside the daemon is racing the row.

What ships

File Role
Instrumentation/ProgressionLockSampler.cs Background poll loop. Separate connection per tick (cheap, dev-tool only). Filters our own pid out so the sampler never observes itself. Best-effort: a transient blip drops a sample.
Instrumentation/LockWaitStats (record in same file) Rolled-up summary that lands in RebuildInstrumentation.Snapshot.
InstrumentationOptions.LockTracePath + --instrument-lock-trace <path> Per-sample CSV trace, implies --instrument like its sibling flag.
RebuildInstrumentation Owns the new sampler alongside ProgressSampler / NpgsqlCommandCounter; one Activity span tags the rolled-up contention numbers.
RebuildCommand / StressCommand Console output picks up three new rows; metrics.json grows a \"progressionLockWaits\" section.
src/Marten.ScaleTesting/README.md Documents the metric, the new flag, and the interpretation.

Verification

Smoke at 22.5k events (4 tenants × 5k × 4 writers):

Aspect Result
Lock trace CSV 6 rows at 0.5s interval, all (0, 0) -- no contention on a single rebuild against a fresh DB (expected).
Console row lock max-waiters 0, waiter-sec 0.0 alongside the existing E.1 throughput / pg-cmd rows.
--instrument off True no-op -- sampler / counter / activity never constructed.

Real non-zero readings need parallel concurrent rebuilds racing the same row; the harness is wired to report them when the situation arises rather than manufacture contention in a smoke test.

Remaining Phase E follow-ups

  • E.3 -- RecentlyUsedCache hit/miss counters. Requires a JasperFx-side instrumentation hook on RecentlyUsedCache<TId, TDoc>.
  • E.4 -- Per-EvolveAsync lookup count. Requires wrapping IDocumentOperations query/load calls inside projection user-code; likely a JasperFx-side hook on SliceGroup.

Parent #4684 stays open until E.3-E.4 land.

🤖 Generated with Claude Code

jeremydmiller and others added 2 commits June 8, 2026 14:00
Second slice of #4684 (item 5 of the parent spec). Marten-side only, no core
hooks needed.

PostgreSQL doesn't expose cumulative per-row lock-wait history, so the sampler
approximates by polling pg_stat_activity at the existing
--instrument-sample-seconds cadence and counting ungranted lock holders on
mt_event_progression that aren't our own session:

  WHERE a.state = 'active'
    AND a.wait_event_type IS NOT NULL
    AND a.pid != pg_backend_pid()
    AND EXISTS (
        SELECT 1 FROM pg_locks l
        JOIN pg_class c ON c.oid = l.relation
        WHERE l.pid = a.pid
          AND c.relname = 'mt_event_progression'
          AND l.granted = false)

Per-sample (timestamp, waiter_count, max_wait_ms) rolls up at end-of-run to
three numbers: max_concurrent_waiters, max_single_wait_ms, and the approximation
observed_waiter_seconds = sum(waiter_count) * sample_interval. Zero across all
three signals no contention; non-zero signals the new concurrent-rebuild cap
(#420 in 2.9.0) is being exercised or that something outside the daemon
is racing the row.

## What ships

* `Instrumentation/ProgressionLockSampler.cs` — background poll loop, separate
  connection per tick (cheap, dev-tool only). Filters our own pid out so the
  sampler never observes itself. Best-effort: transient blip drops a sample.
* `Instrumentation/LockWaitStats` record — the rolled-up summary that lands in
  RebuildInstrumentation.Snapshot.
* `InstrumentationOptions.LockTracePath` + `--instrument-lock-trace <path>` —
  per-sample CSV trace, implies --instrument like its sibling.
* RebuildInstrumentation owns the new sampler alongside the existing
  ProgressSampler / NpgsqlCommandCounter; one Activity span tags the rolled-up
  contention numbers.
* RebuildCommand / StressCommand console output picks up three new rows;
  metrics.json grows a "progressionLockWaits" section.
* README documents the metric and the interpretation.

## Verification

Smoke at 22.5k events (4 tenants × 5k × 4 writers):
* Lock trace CSV: 6 rows at 0.5s interval, all (0, 0) — no contention on a
  single rebuild against a fresh DB (expected).
* Console: "lock max-waiters 0, waiter-sec 0.0" alongside the existing E.1
  throughput / pg-cmd rows.
* --instrument off still a true no-op — sampler / counter / activity never
  constructed.

Real non-zero readings need parallel concurrent rebuilds racing the same row;
the harness is wired to report them when the situation arises rather than try
to manufacture contention in a smoke test.

Phase E.3 (RecentlyUsedCache hit/miss) and E.4 (per-EvolveAsync lookup count)
still need JasperFx-side hooks; pinned as separate PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-on to the previous commit that landed the code; the README edits were
attempted in the same batch but failed against an updated file. Splitting out
as a focused README-only commit so the diff stays clear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 3815168 into master Jun 8, 2026
8 checks passed
@jeremydmiller jeremydmiller deleted the feature/4684-phase-e2-lockwait branch June 8, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant