feat(workloads): add performance metrics collection for DR drill testing#4
Draft
tvaron3 wants to merge 15 commits into
Draft
feat(workloads): add performance metrics collection for DR drill testing#4tvaron3 wants to merge 15 commits into
tvaron3 wants to merge 15 commits into
Conversation
…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
06a9932 to
a138997
Compare
tvaron3
added a commit
that referenced
this pull request
May 30, 2026
…ation tests - Mirror async drain-loop fix in sync routing_map_provider so /pkranges change-feed paginates correctly when the service returns multiple pages per refresh (sync path was previously susceptible to the same incomplete routing map seen in async). - Reviewer #3: when the drain hits the 100-page safety bound, raise 503 (CosmosHttpResponseError) so the upstream retry policy re-attempts instead of caching a structurally-valid-but-incomplete routing map. - Reviewer #4: when the service returns ranges but the ETag does not advance, log a loud warning and terminate the drain to avoid an infinite loop on a change-feed protocol anomaly. - Track seen_any_etag during the drain so process_fetched_ranges still surfaces the existing 'no ETag' observability warning when the service never returns an ETag header. - Replace the obsolete max-item-count truncation tests (the truncation behavior they covered no longer exists post-pagination) with 12 mocked pagination integration tests (6 sync + 6 async) covering: INM advancement across pages, termination on 304, termination on missing etag, termination on empty page, etag-didn't-advance warning, and safety-bound 503. - Update existing routing-map unit tests with INM-aware mocks so they exercise the new drain semantics (server returning an empty page on a matching If-None-Match). - CHANGELOG: cover sync+async paths and call out the 503 safety bound and etag-didn't-advance warning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a performance metrics library to the Python Cosmos DB workloads that reports
PerfResultdocuments to a results Cosmos DB account, matching the Rust perf tool schema exactly. Both SDKs write to the same ADX → Grafana pipeline.New Files
perf_stats.py— Thread-safe latency histogram with sorted-list percentile calculation and atomicdrain_all()for consistent summary+error snapshotsperf_config.py— All config from environment variables (RESULTS_COSMOS_URI,PERF_REPORT_INTERVAL=300s,perfdb/perfresultsdefaults)perf_reporter.py— Background daemon thread that drains Stats every 5 minutes and upserts PerfResult documents via sync CosmosClient with AAD auth (DefaultAzureCredential)Modified Files
workload_configs.py— All configs now driven by environment variables with sensible defaultsworkload_utils.py— Added timed operation wrappers with error tracking (CosmosHttpResponseErrorstatus_code/sub_status extraction). Only successful operations record latency.*_workload.pyfiles — Integrated Stats + PerfReporter with try/finally lifecycle managementPerfResult Document Schema
Matches Rust exactly:
id,partition_key,workload_id,commit_sha,hostname,TIMESTAMPoperation,count,errors,min_ms,max_ms,mean_ms,p50_ms,p90_ms,p99_mscpu_percent,memory_bytes,system_cpu_percent,system_total_memory_bytes,system_used_memory_bytessdk_language="python",sdk_versionfromazure.cosmos.__version__config_concurrency,config_application_region,config_excluded_regions,config_ppcb_enabledKey Design Decisions
psutilfor CPU/memory with/procfallback on Linuxpsutil.Process()instance for accurate CPU readings