feat(workloads): add performance metrics collection for DR drill testing by tvaron3 · Pull Request #4 · tvaron3/azure-sdk-for-python

tvaron3 · 2026-04-10T03:08:18Z

Summary

Add a performance metrics library to the Python Cosmos DB workloads that reports PerfResult documents to a results Cosmos DB account, matching the Rust perf tool schema exactly. Both SDKs write to the same ADX → Grafana pipeline.

New Files

perf_stats.py — Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots
perf_config.py — All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
perf_reporter.py — Background daemon thread that drains Stats every 5 minutes and upserts PerfResult documents via sync CosmosClient with AAD auth (DefaultAzureCredential)

Modified Files

workload_configs.py — All configs now driven by environment variables with sensible defaults
workload_utils.py — Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction). Only successful operations record latency.
All *_workload.py files — Integrated Stats + PerfReporter with try/finally lifecycle management

PerfResult Document Schema

Matches Rust exactly:

Identity: id, partition_key, workload_id, commit_sha, hostname, TIMESTAMP
Metrics: operation, count, errors, min_ms, max_ms, mean_ms, p50_ms, p90_ms, p99_ms
System: cpu_percent, memory_bytes, system_cpu_percent, system_total_memory_bytes, system_used_memory_bytes
Cross-SDK: sdk_language="python", sdk_version from azure.cosmos.__version__
Config: config_concurrency, config_application_region, config_excluded_regions, config_ppcb_enabled

Key Design Decisions

Sorted-list percentiles (no hdrhistogram native dependency needed)
psutil for CPU/memory with /proc fallback on Linux
Cached psutil.Process() instance for accurate CPU readings
CosmosClient properly stored and closed to avoid resource leaks
PPCB disabled by default
All reporter errors caught and logged as warnings — never crash the workload
Error latencies excluded from success percentiles to avoid metric pollution

…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ation tests - Mirror async drain-loop fix in sync routing_map_provider so /pkranges change-feed paginates correctly when the service returns multiple pages per refresh (sync path was previously susceptible to the same incomplete routing map seen in async). - Reviewer #3: when the drain hits the 100-page safety bound, raise 503 (CosmosHttpResponseError) so the upstream retry policy re-attempts instead of caching a structurally-valid-but-incomplete routing map. - Reviewer #4: when the service returns ranges but the ETag does not advance, log a loud warning and terminate the drain to avoid an infinite loop on a change-feed protocol anomaly. - Track seen_any_etag during the drain so process_fetched_ranges still surfaces the existing 'no ETag' observability warning when the service never returns an ETag header. - Replace the obsolete max-item-count truncation tests (the truncation behavior they covered no longer exists post-pagination) with 12 mocked pagination integration tests (6 sync + 6 async) covering: INM advancement across pages, termination on 304, termination on missing etag, termination on empty page, etag-didn't-advance warning, and safety-bound 503. - Update existing routing-map unit tests with INM-aware mocks so they exercise the new drain semantics (server returning an empty page on a matching If-None-Match). - CHANGELOG: cover sync+async paths and call out the 503 safety bound and etag-didn't-advance warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 and others added 2 commits April 9, 2026 19:29

github-actions Bot added Cosmos App Configuration Speech Transcription Cognitive - Content Understanding Evaluation Load Test Service Machine Learning labels Apr 10, 2026

tvaron3 and others added 13 commits April 9, 2026 23:26

perf(workloads): use perf_counter_ns for higher precision timing

14d7797

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): use get_mean_value() for hdrh API

63ae1a8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

style(workloads): fix black formatting and setup_env.sh references

52a3956

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(workloads): add config_multi_write_enabled to PerfResult

12b4a20

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): define multi_write variable in perf_reporter

f647d54

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): address review findings — lazy imports, session close…

b595bd9

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 force-pushed the feat/dr-drill-workload-fixes branch from 06a9932 to a138997 Compare April 12, 2026 01:45

tvaron3 force-pushed the main branch from 3dffaf2 to 90af8cf Compare May 26, 2026 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workloads): add performance metrics collection for DR drill testing#4

feat(workloads): add performance metrics collection for DR drill testing#4
tvaron3 wants to merge 15 commits into
mainfrom
feat/dr-drill-workload-fixes

tvaron3 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tvaron3 commented Apr 10, 2026

Summary

New Files

Modified Files

PerfResult Document Schema

Key Design Decisions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant