Skip to content

feat(workloads): add performance metrics collection for DR drill testing#4

Draft
tvaron3 wants to merge 15 commits into
mainfrom
feat/dr-drill-workload-fixes
Draft

feat(workloads): add performance metrics collection for DR drill testing#4
tvaron3 wants to merge 15 commits into
mainfrom
feat/dr-drill-workload-fixes

Conversation

@tvaron3

@tvaron3 tvaron3 commented Apr 10, 2026

Copy link
Copy Markdown
Owner

Summary

Add a performance metrics library to the Python Cosmos DB workloads that reports PerfResult documents to a results Cosmos DB account, matching the Rust perf tool schema exactly. Both SDKs write to the same ADX → Grafana pipeline.

New Files

  • perf_stats.py — Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots
  • perf_config.py — All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
  • perf_reporter.py — Background daemon thread that drains Stats every 5 minutes and upserts PerfResult documents via sync CosmosClient with AAD auth (DefaultAzureCredential)

Modified Files

  • workload_configs.py — All configs now driven by environment variables with sensible defaults
  • workload_utils.py — Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction). Only successful operations record latency.
  • All *_workload.py files — Integrated Stats + PerfReporter with try/finally lifecycle management

PerfResult Document Schema

Matches Rust exactly:

  • Identity: id, partition_key, workload_id, commit_sha, hostname, TIMESTAMP
  • Metrics: operation, count, errors, min_ms, max_ms, mean_ms, p50_ms, p90_ms, p99_ms
  • System: cpu_percent, memory_bytes, system_cpu_percent, system_total_memory_bytes, system_used_memory_bytes
  • Cross-SDK: sdk_language="python", sdk_version from azure.cosmos.__version__
  • Config: config_concurrency, config_application_region, config_excluded_regions, config_ppcb_enabled

Key Design Decisions

  • Sorted-list percentiles (no hdrhistogram native dependency needed)
  • psutil for CPU/memory with /proc fallback on Linux
  • Cached psutil.Process() instance for accurate CPU readings
  • CosmosClient properly stored and closed to avoid resource leaks
  • PPCB disabled by default
  • All reporter errors caught and logged as warnings — never crash the workload
  • Error latencies excluded from success percentiles to avoid metric pollution

tvaron3 and others added 2 commits April 9, 2026 19:29
…hroughput

- Uncomment concurrent upsert/read/query calls
- Remove manual timing counters and log_request_counts
- Set THROUGHPUT to 1000000 in workload_configs.py
- Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a
Cosmos DB results account, matching the Rust perf tool schema exactly so
both SDKs feed the same ADX → Grafana pipeline.

New files:
- perf_stats.py: Thread-safe latency histogram with sorted-list percentile
  calculation and atomic drain_all() for consistent summary+error snapshots
- perf_config.py: All config from environment variables (RESULTS_COSMOS_URI,
  PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
- perf_reporter.py: Background daemon thread that drains Stats every 5 min
  and upserts PerfResult documents via sync CosmosClient with AAD auth

Modified files:
- workload_configs.py: All configs now driven by environment variables
- workload_utils.py: Added timed operation wrappers with error tracking
  (CosmosHttpResponseError status_code/sub_status extraction), only
  successful operations record latency to avoid polluting percentiles
- All *_workload.py files: Integrated Stats + PerfReporter with try/finally
  lifecycle management

Key design decisions:
- Sorted-list percentiles (no hdrhistogram native dependency)
- psutil for CPU/memory with /proc fallback on Linux
- Cached psutil.Process() instance for accurate CPU readings
- CosmosClient stored and closed properly to avoid resource leaks
- sdk_language='python', sdk_version from azure.cosmos.__version__
- PPCB disabled by default
- All reporter errors caught and logged as warnings (never crash workload)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 13 commits April 9, 2026 23:26
psutil is now a hard import (not optional). Removed all /proc/meminfo
and /proc/self/status fallback parsing — if psutil is not installed,
the import will fail immediately rather than silently degrading.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files
- WORKLOAD_OPERATIONS env var controls which ops run (read,write,query)
- WORKLOAD_USE_PROXY env var enables Envoy proxy routing
- WORKLOAD_USE_SYNC env var enables sync client
- Validate operation names at import time with clear error
- Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query)
- Fixed memory usage (~40KB per histogram vs unbounded list growth)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py

Removed: r_workload.py, w_workload.py, r_proxy_workload.py,
w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py,
r_w_q_sync_workload.py

All replaced by workload.py with WORKLOAD_OPERATIONS and
WORKLOAD_USE_PROXY env vars.

Kept: r_w_q_with_incorrect_client_workload.py (special test case)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var:
WORKLOAD_SKIP_CLOSE=true creates the client without a context manager,
simulating applications that don't properly close the Cosmos client.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000
for nanosecond precision without floating-point multiplication artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo,
not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py)
stays here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram

The pip package is 'hdrhistogram' but the Python module is 'hdrh'.
Import changed from 'from hdrhistogram import HdrHistogram' to
'from hdrh.histogram import HdrHistogram'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot
so it's visible in the Grafana dashboard and queryable from Kusto.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary

Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of
root .vscode/cspell.json to keep changes within cosmos folder scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 force-pushed the feat/dr-drill-workload-fixes branch from 06a9932 to a138997 Compare April 12, 2026 01:45
tvaron3 added a commit that referenced this pull request May 30, 2026
…ation tests

- Mirror async drain-loop fix in sync routing_map_provider so /pkranges
  change-feed paginates correctly when the service returns multiple pages
  per refresh (sync path was previously susceptible to the same incomplete
  routing map seen in async).
- Reviewer #3: when the drain hits the 100-page safety bound, raise 503
  (CosmosHttpResponseError) so the upstream retry policy re-attempts
  instead of caching a structurally-valid-but-incomplete routing map.
- Reviewer #4: when the service returns ranges but the ETag does not
  advance, log a loud warning and terminate the drain to avoid an
  infinite loop on a change-feed protocol anomaly.
- Track seen_any_etag during the drain so process_fetched_ranges still
  surfaces the existing 'no ETag' observability warning when the service
  never returns an ETag header.
- Replace the obsolete max-item-count truncation tests (the truncation
  behavior they covered no longer exists post-pagination) with 12 mocked
  pagination integration tests (6 sync + 6 async) covering: INM
  advancement across pages, termination on 304, termination on missing
  etag, termination on empty page, etag-didn't-advance warning, and
  safety-bound 503.
- Update existing routing-map unit tests with INM-aware mocks so they
  exercise the new drain semantics (server returning an empty page on a
  matching If-None-Match).
- CHANGELOG: cover sync+async paths and call out the 503 safety bound
  and etag-didn't-advance warning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant