feat(api): Add Prometheus metrics and Grafana dashboard for staging monitoring by FSM1 · Pull Request #141 · FSM1/cipher-box

FSM1 · 2026-02-18T01:27:25Z

Summary

This PR adds comprehensive Prometheus metrics collection and a pre-built Grafana dashboard to monitor the CipherBox staging environment. Metrics are scraped from the API's /metrics endpoint and shipped to Grafana Cloud Mimir, complementing the existing Loki log aggregation.

Key Changes

Metrics Service (apps/api/src/metrics/)
- MetricsService: Central registry managing Prometheus metrics (gauges, counters, histograms)
- Gauges polled from database every 30 seconds: users, files, storage, IPNS entries, republish schedule status
- Event-driven counters: file operations, IPNS publishes/resolves, authentication, republish results
- HTTP request duration histogram with method/route/status labels
- MetricsController: Exposes /metrics endpoint (excluded from Swagger, unauthenticated)
- HttpMetricsInterceptor: Captures HTTP request metrics globally across all routes
Integration
- MetricsModule registered as global module in AppModule
- HttpMetricsInterceptor registered as global interceptor
- Metrics incremented in controllers: AuthController (logins), IpnsController (publishes/resolves), IpfsController (uploads/downloads), RepublishProcessor (republish results)
Monitoring Infrastructure
- Updated alloy-config.river: Added Prometheus scrape job for API /metrics endpoint (30s interval) with remote write to Grafana Cloud Mimir
- cipherbox-staging.json: Complete Grafana dashboard with 20+ panels covering:
  - Overview stats (users, files, storage, IPNS entries)
  - File operations (uploads, downloads, storage growth)
  - IPNS operations (publishes, resolves, entry growth)
  - TEE republishing (schedule status, results, retry tracking)
  - Authentication (login methods, new vs returning users)
  - HTTP performance (request rates, response times p50/p95/p99, error rates)
  - Log panels (API errors, TEE republish logs)
Configuration
- Added GitHub secrets: GRAFANA_PROMETHEUS_URL, GRAFANA_PROMETHEUS_USERNAME, GRAFANA_PROMETHEUS_API_KEY
- Updated deploy workflow and docker-compose to pass Prometheus credentials to Alloy
- Updated MONITORING.md with Prometheus setup instructions and metrics reference
Dependencies
- Added prom-client@^15.1.3 for Prometheus client library

Implementation Details

Metrics use label cardinality controls (e.g., route normalization in HTTP interceptor) to stay within Grafana Cloud Free tier limits (10,000 series)
Gauges are collected asynchronously every 30 seconds to avoid blocking request handling
HTTP metrics exclude /metrics endpoint itself to prevent self-referential loops
Dashboard uses 6-hour default time range with 30-second refresh for real-time visibility

https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w

Summary by CodeRabbit

New Features
- Comprehensive Prometheus metrics (file ops, auth, IPNS, republish, HTTP request durations) and a /metrics endpoint
- New Grafana "CipherBox Staging" dashboard for observability
Documentation
- Expanded monitoring guide with PromQL examples, dashboards, and troubleshooting
Chores
- Added Prometheus client and staging Grafana integration; staging configs and workflows updated to use repo-level variables; staging proxy now blocks direct /metrics access
Tests
- Tests updated to inject/mock metrics instrumentation

Add prom-client to the API with a MetricsModule that exposes /metrics for Grafana Alloy to scrape. Metrics include DB-polled gauges (users, files, storage, IPNS entries, republish schedule status) and event-driven counters (uploads, downloads, IPNS publishes/resolves, republish results, auth logins) plus an HTTP request duration histogram. - New MetricsModule (global) with MetricsService, MetricsController - HttpMetricsInterceptor for request duration tracking - Instrumented IpfsController, IpnsController, RepublishProcessor, AuthController - Updated Alloy config to scrape /metrics and remote-write to Grafana Cloud Mimir - Added Prometheus env vars to docker-compose.staging.yml and deploy workflow - Pre-built Grafana dashboard JSON with 20+ panels across 6 sections - Updated MONITORING.md with full metrics reference and setup instructions https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w

coderabbitai · 2026-02-18T01:27:41Z

Warning

Rate limit exceeded

@FSM1 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 10 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

Walkthrough

Adds Prometheus-based observability: new MetricsService, HttpMetricsInterceptor, /metrics endpoint, instrumentation across controllers/processors and tests, Prometheus scrape/remote_write and Grafana dashboard, plus CI/docker staging env updates and Caddy rule to block /metrics. (≤50 words)

Changes

Cohort / File(s)	Summary
Metrics core `apps/api/src/metrics/metrics.service.ts`, `apps/api/src/metrics/metrics.module.ts`, `apps/api/src/metrics/metrics.controller.ts`, `apps/api/src/metrics/http-metrics.interceptor.ts`, `apps/api/src/metrics/index.ts`	Adds MetricsService (registry, gauges, counters, histogram), global MetricsModule, a GET /metrics controller, HttpMetricsInterceptor that records request durations and normalizes routes, and re-exports.
Instrumentation wiring `apps/api/src/app.module.ts`, `apps/api/scripts/generate-openapi.ts`	Registers MetricsModule and configures HttpMetricsInterceptor as a global APP_INTERCEPTOR; provides a mock MetricsService in the OpenAPI generation script.
Controller & processor instrumentation `apps/api/src/auth/auth.controller.ts`, `apps/api/src/ipfs/ipfs.controller.ts`, `apps/api/src/ipns/ipns.controller.ts`, `apps/api/src/republish/republish.processor.ts`	Injects MetricsService and adds increments/observations for auth logins, file uploads/downloads/unpins, IPNS publishes/resolves, and republish runs/results.
Tests updated for metrics `apps/api/src/auth/auth.controller.spec.ts`, `apps/api/src/ipfs/ipfs.controller.spec.ts`, `apps/api/src/ipns/__tests__/*`, `apps/api/src/republish/republish.processor.spec.ts`	Adds MetricsService mocks/providers and updates test setup to satisfy new constructor/injection requirements.
Dependency `apps/api/package.json`	Adds `prom-client@^15.1.3` dependency for Prometheus instrumentation.
Staging CI & compose vars `.github/workflows/deploy-staging.yml`, `.github/workflows/e2e.yml`, `docker/docker-compose.staging.yml`	Swaps VITE_WEB3AUTH_CLIENT_ID to repository vars in CI workflows and adds GRAFANA_PROMETHEUS_URL / GRAFANA_PROMETHEUS_USERNAME / GRAFANA_PROMETHEUS_API_KEY env entries for staging.
Monitoring infra & docs `docker/alloy-config.river`, `docker/grafana/dashboards/cipherbox-staging.json`, `docker/MONITORING.md`, `docker/Caddyfile`	Adds Prometheus scrape + remote_write to alloy config, new Grafana dashboard JSON and monitoring docs updates, and a Caddy rule that returns 403 for /metrics on the staging domain.
Misc (tests/coverage) `apps/api/jest.config.js`, `apps/api/scripts/generate-openapi.ts`	Excludes metrics from Jest coverage and registers MetricsService as a mocked provider for OpenAPI generation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Caddy
    participant API
    participant MetricsService
    participant Prometheus
    participant Grafana

    Client->>Caddy: HTTP request (/api/...)
    Caddy->>API: proxy request
    API->>API: HttpMetricsInterceptor records start time
    API->>API: Controller executes business logic
    API->>MetricsService: increment counters / observe histogram
    API-->>Client: response
    Prometheus->>API: scrape GET /metrics
    API->>MetricsService: MetricsService.getMetrics()
    API-->>Prometheus: metrics payload (text/plain)
    Grafana->>Prometheus: query metrics (PromQL) / fetch dashboards

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat: Phase 9.1 — Environment, DevOps & Staging Deployment #64 — Related staging/monitoring infra changes (deploy workflows, alloy Prometheus config, Caddyfile, monitoring docs) that are extended here with Prometheus metrics and dashboard.
feat(12.6): per-file IPNS metadata split #133 — Modifies IPNS/republish codepaths that overlap with this PR’s instrumentation of IPNS publishes/resolves and republish processor.
Phase 7 - Multi Device Sync #54 — Changes to apps/api/src/ipns/ipns.controller.ts that touch the same controller methods instrumented for metrics in this PR.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding Prometheus metrics and a Grafana dashboard to the staging environment for monitoring purposes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/grafana-staging-dashboards-HuV2C

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Client IDs are not secrets — they're embedded in the built JS bundle. Using vars instead of secrets for this value across deploy-staging and e2e workflows. https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/api/package.json (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Branch coverage threshold failing (76.18% < 80%) must be resolved before merge.

The new MetricsModule, MetricsService, MetricsController, and HttpMetricsInterceptor are added without accompanying tests, which dropped branch coverage below the required 80%. The processor test is also broken (see republish.processor.ts). Unit tests for the metrics subsystem need to be added before this can merge.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/package.json` at line 1, Branch coverage fell below threshold
because new MetricsModule, MetricsService, MetricsController, and
HttpMetricsInterceptor lack unit tests and there's a broken test in
republish.processor.ts; add focused unit tests for each new symbol
(MetricsModule, MetricsService, MetricsController, HttpMetricsInterceptor)
exercising branches (success, error, edge cases and config toggles) and update
or fix the republish.processor.ts test to restore its expected behavior (mock
dependencies, assert side effects, and handle async flows). Ensure you create
mocks for injected dependencies, cover error paths and conditional branches in
methods of MetricsService and HttpMetricsInterceptor, add controller tests for
endpoints in MetricsController, and include the module-level integration tests
for MetricsModule so overall branch coverage meets >=80%.

🧹 Nitpick comments (8)

apps/api/src/metrics/metrics.controller.ts (1)
12-16: Consider @Res({ passthrough: true }) to keep NestJS response lifecycle active.

With the default @Res() (passthrough off), NestJS skips global interceptors and response transformations for this route. While the metrics endpoint doesn't need them, it also means an unhandled exception from getMetrics() may not be cleaned up by the standard exception filter. Using passthrough: true and returning the value is the idiomatic NestJS approach when only header mutation is needed.
♻️ Optional refactor
-async getMetrics(`@Res`() res: Response): Promise<void> {
-  const metrics = await this.metricsService.getMetrics();
-  res.set('Content-Type', this.metricsService.getContentType());
-  res.send(metrics);
-}
+async getMetrics(`@Res`({ passthrough: true }) res: Response): Promise<string> {
+  res.set('Content-Type', this.metricsService.getContentType());
+  return this.metricsService.getMetrics();
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.controller.ts` around lines 12 - 16, The
getMetrics handler currently uses `@Res`() which disables NestJS response
lifecycle; change the decorator to `@Res`({ passthrough: true }) on the getMetrics
method, keep using res.set('Content-Type', this.metricsService.getContentType())
to set headers, and return the metrics value from metricsService.getMetrics()
instead of calling res.send(metrics) so global interceptors and exception
filters remain active; update references in the getMetrics method to use the
returned value flow with the metricsService.getMetrics and
metricsService.getContentType calls.
docker/docker-compose.staging.yml (1)
136-137: Consider adding condition: service_healthy to alloy's api dependency.

The api service has no healthcheck, so depends_on: - api only waits for the container to start, not for the API to be ready. The first few scrapes will fail silently; Prometheus is tolerant of this, but making the dependency explicit with a healthcheck on the API container would give a cleaner startup ordering.
🔧 Optional improvement

Add a healthcheck to the api service:
 api:
   image: ...
+  healthcheck:
+    test: ['CMD-SHELL', 'wget -qO- http://localhost:3000/health || exit 1']
+    interval: 10s
+    timeout: 5s
+    retries: 5
+    start_period: 20s
Then tighten alloy's dependency:
-    depends_on:
-      - api
+    depends_on:
+      api:
+        condition: service_healthy
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/docker-compose.staging.yml` around lines 136 - 137, Add a Docker
healthcheck to the api service (e.g., a lightweight curl or tcp check that
returns 0 when the API is responsive, with sensible interval/retries) and then
update alloy's depends_on entry for api to use condition: service_healthy so
alloy waits for the API to be ready (reference the service names "api" and
"alloy" and edit the docker-compose service definitions for those services).
docker/grafana/dashboards/cipherbox-staging.json (1)
818-834: Consider adding P95 threshold annotations for key routes.

The project has documented performance targets (P95): auth flow <3s, file upload/download <5s, IPNS publish <2s. Adding constant threshold lines to the response time panels would make it easy to spot regressions at a glance. This can be done as a follow-up.

Based on learnings: "Performance benchmark targets (P95): auth flow <3s, file upload/download (<100MB) <5s, IPNS publish <2s, folder create <1s, multi-device sync latency <30s."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/grafana/dashboards/cipherbox-staging.json` around lines 818 - 834,
This panel currently plots histogram_quantile p50/p95/p99 (refId A/B/C); add
explicit P95 threshold lines by appending new targets with unique refIds (e.g.,
D,E,F,...) where each target's expr is a scalar constant (e.g., scalar(3),
scalar(5), scalar(2), scalar(1), scalar(30)) and legendFormat describes the
threshold (auth_p95_threshold, upload_p95_threshold, ipns_publish_p95_threshold,
folder_create_p95_threshold, multi_device_sync_p95_threshold); ensure these new
targets are styled as thin/dashed colored lines (and optionally excluded from
main legend) so they appear as constant threshold annotations on the same
response-time panels for easy visual comparison against the histogram_quantile
curves.
apps/api/src/metrics/metrics.service.ts (5)
18-18: registry is public but the class already exposes getMetrics() and getContentType() as its intended API surface

Exposing the raw Registry instance allows any consumer to register arbitrary metrics into this registry, reset it, or call registry.clear(). Making it private (or protected) enforces encapsulation without breaking any existing usage, since the MetricsController only calls getMetrics().
♻️ Proposed fix
-  readonly registry: client.Registry;
+  private readonly registry: client.Registry;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.service.ts` at line 18, The public Registry
field exposes internal state; change the readonly registry: client.Registry
declaration in the MetricsService class to private (or protected) so external
callers cannot mutate the registry directly, keeping the public API via
getMetrics() and getContentType(); update any internal references inside
MetricsService to use the now-private registry and adjust any tests or consumers
that referenced metricsService.registry to call the public methods instead.
16-16: OnModuleDestroy is missing from the implements clause

onModuleDestroy() exists and is correct, but the class only declares implements OnModuleInit. NestJS discovers lifecycle hooks by duck-typing so the interval will be cleared, but omitting the interface declaration removes the TypeScript compiler check (e.g., typo in method name or wrong return type would go undetected).
♻️ Proposed fix
-import { Injectable, OnModuleInit, Logger } from '@nestjs/common';
+import { Injectable, OnModuleInit, OnModuleDestroy, Logger } from '@nestjs/common';
...
-export class MetricsService implements OnModuleInit {
+export class MetricsService implements OnModuleInit, OnModuleDestroy {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.service.ts` at line 16, The class MetricsService
declares lifecycle method onModuleDestroy but only implements OnModuleInit;
update the class declaration to also implement OnModuleDestroy (i.e., change
"implements OnModuleInit" to "implements OnModuleInit, OnModuleDestroy") and
ensure OnModuleDestroy is imported from NestJS (where OnModuleInit is imported)
so the TypeScript compiler enforces the lifecycle hook signature for
onModuleDestroy.
56-86: Gauge metrics should not carry the _total suffix — that suffix is reserved for counters by Prometheus convention

cipherbox_users_total, cipherbox_files_total, cipherbox_storage_bytes_total, cipherbox_ipns_entries_total, and cipherbox_republish_schedule_total are all client.Gauge instances. The _total suffix signals a monotonically increasing counter to Prometheus tooling and dashboards, which conflicts with the mutable/resettable semantics of these gauge metrics. Consider renaming them to drop _total (e.g., cipherbox_users, cipherbox_files, cipherbox_storage_bytes, cipherbox_ipns_entries, cipherbox_republish_schedule).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.service.ts` around lines 56 - 86, The Gauges
this.usersTotal, this.filesTotal, this.storageBytesTotal, this.ipnsEntriesTotal,
and this.republishScheduleTotal are named with a `_total` suffix which is
reserved for counters; rename the metric name strings in each Gauge definition
to drop `_total` (e.g., change 'cipherbox_users_total' → 'cipherbox_users',
'cipherbox_files_total' → 'cipherbox_files', 'cipherbox_storage_bytes_total' →
'cipherbox_storage_bytes', 'cipherbox_ipns_entries_total' →
'cipherbox_ipns_entries', 'cipherbox_republish_schedule_total' →
'cipherbox_republish_schedule') and update any code that references those metric
names (places that call set(), labels(), inc(), etc. on these Gauge instances)
to use the updated Gauge objects/names so instrumentation and dashboards remain
consistent.
1-219: Unauthenticated /metrics endpoint exposes operational intelligence; verify API client regeneration

Two operational concerns for this PR:

Security posture: The /metrics endpoint is intentionally unauthenticated. It exposes user/file/storage counts, authentication patterns (method, new_user labels), IPNS record-type distributions, and HTTP route latency histograms. For a staging environment reachable from outside the private network this constitutes a meaningful information-disclosure surface. Consider at minimum a static bearer token check in MetricsController, IP-allow-listing at the load balancer, or binding the endpoint to a separate internal-only port.

API client regeneration: The PR introduces a new MetricsController under apps/api. As per the coding guidelines, "After modifying API endpoints, DTOs, or controllers in apps/api, regenerate the API client by running pnpm api:generate to keep the web app type-safe and in sync, and commit the regenerated client files." Even though the endpoint is decorated with @ApiExcludeEndpoint(), the regeneration step should be run and any changed (or unchanged) client files committed to confirm the spec is stable. As per coding guidelines: "After modifying API endpoints, DTOs, or controllers in apps/api, regenerate the API client by running pnpm api:generate."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.service.ts` around lines 1 - 219, Metrics
endpoint is currently unauthenticated and the new controller requires
coordination with the API client; update the MetricsController to restrict
access (e.g., enforce a static bearer token or guard that checks an
internal-only header/IP or apply NestJS AuthGuard) so that calls to the
controller's methods (the route that calls MetricsService.getMetrics /
getContentType) are protected, then run pnpm api:generate and commit the
regenerated API client files to keep the web app types in sync.
157-165: Use prom-client's built-in collect() function for gauge population instead of manual setInterval

The prom-client documentation recommends using collect() as the idiomatic pattern for gauge updates. The collect() function is invoked on each scrape (when /metrics is called), ensuring values are always fresh at scrape time and eliminating the startup timing gap. This removes the need for manual setInterval/onModuleDestroy lifecycle management.

For async operations, pass an async function (not an arrow function—this binding matters) directly in the gauge configuration:
this.usersTotal = new client.Gauge({
  name: 'cipherbox_users_total',
  help: 'Total registered users',
  registers: [this.registry],
  collect: async function () {
    this.set(await this.userRepository.count());
  },
});
Note: Since collect() runs on the scrape path, consider whether querying the database on every scrape is appropriate. If the current 30-second polling interval is important for performance, you may instead use collect() with cached results updated by a background timer.

Apply to all five gauges (lines 56–86) and remove onModuleInit/onModuleDestroy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/auth/auth.controller.ts`:
- Around line 70-73: The authLogins counter is only incremented on successful
authentication; wrap the call to this.authService.login(...) in a try/catch and
update metric emission to include a result label so both successes and failures
are recorded: on success call this.metricsService.authLogins.inc({ method:
'web3auth', result: 'success', new_user: String(result.isNewUser) }); on error
catch UnauthorizedException (or any thrown error), call
this.metricsService.authLogins.inc({ method: 'web3auth', result: 'failure',
new_user: 'false' }) (or appropriate default), then rethrow the error so
behavior is unchanged; update references to metricsService.authLogins.inc and
authService.login accordingly.

In `@apps/api/src/ipns/ipns.controller.ts`:
- Around line 109-113: The batch publish metric currently counts both successes
and failures; update the increment in the IPNS controller so
metricsService.ipnsPublishes.inc({ type: 'batch' }, ...) uses only
result.totalSucceeded (not totalSucceeded + totalFailed) to match single publish
behavior (see ipnsService.publishBatch and the single publish path that only
increments on success), or alternatively add a label like {type: 'batch',
outcome: 'succeeded'|'failed'} and call inc separately for succeeded and failed
counts to make success/failure explicit.

In `@apps/api/src/metrics/http-metrics.interceptor.ts`:
- Around line 21-26: The error branch in the interceptor records statusCode from
the response too early; change the error handler in http-metrics.interceptor.ts
to derive status by inspecting the thrown error (check instanceof HttpException
or presence of getStatus()) and call error.getStatus() when available, falling
back to 500 for unknown errors, then pass that status into recordDuration(req,
statusCode, startTime); ensure you import HttpException from `@nestjs/common` and
leave the success/complete paths unchanged.

In `@apps/api/src/metrics/metrics.controller.ts`:
- Around line 6-16: The /metrics endpoint is publicly exposed via
MetricsController.getMetrics and leaks sensitive operational data; either block
it at the reverse proxy by adding a Caddy-level deny for /metrics (e.g., add a
respond /metrics 403 before the reverse_proxy) or secure it in the application
by adding an authentication guard to MetricsController (decorate the class or
getMetrics with `@UseGuards`(JwtAuthGuard) and require Bearer tokens) and ensure
MetricsService.getContentType/getMetrics remain accessible only to authenticated
requests; pick one approach and implement the corresponding config change or
decorator to prevent unauthenticated access.

In `@apps/api/src/metrics/metrics.service.ts`:
- Around line 157-165: The initial await this.collectGauges() in onModuleInit
can throw and abort startup; wrap that call in a try/catch so errors are caught
and logged (e.g., this.logger.warn(`Gauge collection failed: ${err.message}`))
similar to the setInterval handler, ensuring collectInterval is still created
and the method completes normally; update onModuleInit to call collectGauges()
inside a try/catch block (referencing onModuleInit, collectGauges,
collectInterval, and this.logger) so startup is resilient to transient DB
errors.

In `@apps/api/src/republish/republish.processor.ts`:
- Around line 11-26: Tests for RepublishProcessor fail because the DI test
module doesn't provide a MetricsService, causing reads of
republishRuns/republishEntriesProcessed to throw; update the RepublishProcessor
test setup to add a mock provider for MetricsService that supplies republishRuns
and republishEntriesProcessed objects with an inc method (jest.fn() or
equivalent) and any other metric methods used, and register this provider in the
Test.createTestingModule providers so the injected MetricsService used by
RepublishProcessor.process (and by RepublishService if needed) is defined during
tests.

In `@docker/MONITORING.md`:
- Line 83: Update the metric description for cipherbox_auth_logins_total to
reflect its actual semantics: change "Login attempts" to "Successful logins" (or
"Login events") in MONITORING.md; ensure this matches the implementation where
the counter authLogins in AuthController is incremented only after a successful
login so documentation and metric name/labels remain consistent.

---

Outside diff comments:
In `@apps/api/package.json`:
- Line 1: Branch coverage fell below threshold because new MetricsModule,
MetricsService, MetricsController, and HttpMetricsInterceptor lack unit tests
and there's a broken test in republish.processor.ts; add focused unit tests for
each new symbol (MetricsModule, MetricsService, MetricsController,
HttpMetricsInterceptor) exercising branches (success, error, edge cases and
config toggles) and update or fix the republish.processor.ts test to restore its
expected behavior (mock dependencies, assert side effects, and handle async
flows). Ensure you create mocks for injected dependencies, cover error paths and
conditional branches in methods of MetricsService and HttpMetricsInterceptor,
add controller tests for endpoints in MetricsController, and include the
module-level integration tests for MetricsModule so overall branch coverage
meets >=80%.

---

Duplicate comments:
In `@apps/api/src/auth/auth.controller.ts`:
- Around line 48-51: The AuthController constructor injects MetricsService
causing the same DI failure in tests; update the AuthController test setup to
include a mock provider for MetricsService (like the one added for
IpfsController tests) so the test module can resolve the dependency — add a
provider with provide: MetricsService and useValue: { increment: jest.fn(),
timing: jest.fn(), /* any used methods */ } or equivalent mock object in the
Test.createTestingModule providers array used for testing AuthController.

In `@apps/api/src/ipns/ipns.controller.ts`:
- Around line 36-39: The constructor injection for MetricsService in the
IpnsController is causing the same circular DI failure; update the constructor
parameter to use Nest's forwardRef injection (use `@Inject`(forwardRef(() =>
MetricsService)) on the metricsService parameter) and add/import Inject and
forwardRef from '@nestjs/common' so the IpnsController's constructor signature
mirrors the fix applied to IpfsController (keep the type as MetricsService).

---

Nitpick comments:
In `@apps/api/src/metrics/metrics.controller.ts`:
- Around line 12-16: The getMetrics handler currently uses `@Res`() which disables
NestJS response lifecycle; change the decorator to `@Res`({ passthrough: true })
on the getMetrics method, keep using res.set('Content-Type',
this.metricsService.getContentType()) to set headers, and return the metrics
value from metricsService.getMetrics() instead of calling res.send(metrics) so
global interceptors and exception filters remain active; update references in
the getMetrics method to use the returned value flow with the
metricsService.getMetrics and metricsService.getContentType calls.

In `@apps/api/src/metrics/metrics.service.ts`:
- Line 18: The public Registry field exposes internal state; change the readonly
registry: client.Registry declaration in the MetricsService class to private (or
protected) so external callers cannot mutate the registry directly, keeping the
public API via getMetrics() and getContentType(); update any internal references
inside MetricsService to use the now-private registry and adjust any tests or
consumers that referenced metricsService.registry to call the public methods
instead.
- Line 16: The class MetricsService declares lifecycle method onModuleDestroy
but only implements OnModuleInit; update the class declaration to also implement
OnModuleDestroy (i.e., change "implements OnModuleInit" to "implements
OnModuleInit, OnModuleDestroy") and ensure OnModuleDestroy is imported from
NestJS (where OnModuleInit is imported) so the TypeScript compiler enforces the
lifecycle hook signature for onModuleDestroy.
- Around line 56-86: The Gauges this.usersTotal, this.filesTotal,
this.storageBytesTotal, this.ipnsEntriesTotal, and this.republishScheduleTotal
are named with a `_total` suffix which is reserved for counters; rename the
metric name strings in each Gauge definition to drop `_total` (e.g., change
'cipherbox_users_total' → 'cipherbox_users', 'cipherbox_files_total' →
'cipherbox_files', 'cipherbox_storage_bytes_total' → 'cipherbox_storage_bytes',
'cipherbox_ipns_entries_total' → 'cipherbox_ipns_entries',
'cipherbox_republish_schedule_total' → 'cipherbox_republish_schedule') and
update any code that references those metric names (places that call set(),
labels(), inc(), etc. on these Gauge instances) to use the updated Gauge
objects/names so instrumentation and dashboards remain consistent.
- Around line 1-219: Metrics endpoint is currently unauthenticated and the new
controller requires coordination with the API client; update the
MetricsController to restrict access (e.g., enforce a static bearer token or
guard that checks an internal-only header/IP or apply NestJS AuthGuard) so that
calls to the controller's methods (the route that calls
MetricsService.getMetrics / getContentType) are protected, then run pnpm
api:generate and commit the regenerated API client files to keep the web app
types in sync.

In `@docker/docker-compose.staging.yml`:
- Around line 136-137: Add a Docker healthcheck to the api service (e.g., a
lightweight curl or tcp check that returns 0 when the API is responsive, with
sensible interval/retries) and then update alloy's depends_on entry for api to
use condition: service_healthy so alloy waits for the API to be ready (reference
the service names "api" and "alloy" and edit the docker-compose service
definitions for those services).

In `@docker/grafana/dashboards/cipherbox-staging.json`:
- Around line 818-834: This panel currently plots histogram_quantile p50/p95/p99
(refId A/B/C); add explicit P95 threshold lines by appending new targets with
unique refIds (e.g., D,E,F,...) where each target's expr is a scalar constant
(e.g., scalar(3), scalar(5), scalar(2), scalar(1), scalar(30)) and legendFormat
describes the threshold (auth_p95_threshold, upload_p95_threshold,
ipns_publish_p95_threshold, folder_create_p95_threshold,
multi_device_sync_p95_threshold); ensure these new targets are styled as
thin/dashed colored lines (and optionally excluded from main legend) so they
appear as constant threshold annotations on the same response-time panels for
easy visual comparison against the histogram_quantile curves.

- Fix HTTP metrics interceptor to derive status from exception (not response) - Fix batch IPNS publish metric to only count succeeded records - Make initial gauge collection resilient to DB unavailability - Block /metrics endpoint externally via Caddy - Add MetricsService mocks to all test files missing them - Rename "Login attempts" to "Successful logins" in docs and help text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Infrastructure code like the metrics module (MetricsService, MetricsController, HttpMetricsInterceptor) is excluded from coverage, consistent with the existing health/ exclusion pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

docker/Caddyfile (1)
11-11: LGTM — optionally harden the path matcher to cover the trailing-slash variant.

The directive is syntactically correct. The first non-matcher argument to respond can be a 3-digit status code, so /metrics is correctly parsed as the inline path matcher and 403 as the status, and Caddy's built-in directive order places respond before reverse_proxy, so the 403 short-circuits before any proxying regardless of file position.

One minor hardening note: Caddy path matching is exact, not prefix-based; wildcards at the end (/prefix/*) are required for a prefix match. This means respond /metrics 403 does not block GET /metrics/ (trailing slash). In practice this is low risk since prom-client doesn't register sub-paths, but /metrics* would be more defensive:
🛡️ Optional: cover trailing-slash variant
-	respond /metrics 403
+	respond /metrics* 403
Also confirm that the Alloy alloy-config.river scrape target points to http://api:3000/metrics (direct Docker service address) and not the public api-staging.cipherbox.cc hostname — otherwise scraping would itself receive the 403.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/Caddyfile` at line 11, Update the Caddy respond matcher to cover the
trailing-slash variant by changing the matcher from "respond /metrics 403" to a
prefix/wildcard form such as "respond /metrics* 403" (or "/metrics/*" if you
prefer exact prefix) so both /metrics and /metrics/ are blocked; also verify the
Alloy scrape target in alloy-config.river points to the internal Docker service
URL "http://api:3000/metrics" (not the public api-staging hostname) so scraping
won’t be blocked by this respond rule.
apps/api/src/metrics/metrics.service.ts (1)
16-16: Add OnModuleDestroy to the implements clause.

onModuleDestroy is implemented at line 170 but OnModuleDestroy is missing from the interface list. NestJS discovers lifecycle hooks by method name at runtime so cleanup still runs, but TypeScript won't validate the signature if it's ever accidentally renamed or mistyped.
♻️ Proposed fix
-import { Injectable, OnModuleInit, Logger } from '@nestjs/common';
+import { Injectable, OnModuleInit, OnModuleDestroy, Logger } from '@nestjs/common';

-export class MetricsService implements OnModuleInit {
+export class MetricsService implements OnModuleInit, OnModuleDestroy {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/metrics/metrics.service.ts` at line 16, The MetricsService class
currently implements OnModuleInit but also defines onModuleDestroy; update the
class declaration to implement both OnModuleInit and OnModuleDestroy (i.e., add
OnModuleDestroy to the implements clause on MetricsService) and ensure
OnModuleDestroy is imported from NestJS so TypeScript will validate the
onModuleDestroy signature (refer to the MetricsService class and the
onModuleDestroy method).
apps/api/src/ipns/__tests__/ipns.integration.spec.ts (1)
170-173: Extract the repeated mock into a shared factory to remove duplication.

mockMetricsService, mockMetricsService2, and mockMetricsService3 are structurally identical but declared three times. A single factory avoids the sequential suffix naming and makes future mock shape changes a one-line edit.
♻️ Proposed refactor

Add a helper before the first describe block:
+function makeMetricsServiceMock() {
+  return {
+    ipnsPublishes: { inc: jest.fn() },
+    ipnsResolves: { inc: jest.fn() },
+  };
+}
Then replace each inline mock declaration:
-      const mockMetricsService = {
-        ipnsPublishes: { inc: jest.fn() },
-        ipnsResolves: { inc: jest.fn() },
-      };
+      const mockMetricsService = makeMetricsServiceMock();
-      const mockMetricsService2 = {
-        ipnsPublishes: { inc: jest.fn() },
-        ipnsResolves: { inc: jest.fn() },
-      };
+      const mockMetricsService2 = makeMetricsServiceMock();
-      const mockMetricsService3 = {
-        ipnsPublishes: { inc: jest.fn() },
-        ipnsResolves: { inc: jest.fn() },
-      };
+      const mockMetricsService3 = makeMetricsServiceMock();
Also applies to: 331-334
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/ipns/__tests__/ipns.integration.spec.ts` around lines 170 - 173,
Extract the repeated mock into a single factory function (e.g.,
createMockMetricsService) declared before the first describe; have it return {
ipnsPublishes: { inc: jest.fn() }, ipnsResolves: { inc: jest.fn() } }. Replace
the three inline objects named mockMetricsService, mockMetricsService2, and
mockMetricsService3 by calling createMockMetricsService() wherever those mocks
are used (in the tests referencing the mock objects), so all mock shape changes
are centralized in the factory.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/metrics/http-metrics.interceptor.ts`:
- Around line 36-57: normalizeRoute is applied to both matched route templates
and raw paths causing static templates like "/ipns/publish" to be mangled and
`/metrics` scrapes still get recorded; fix by making normalizeRoute only perform
regex replacements when given a raw path (i.e., call it only for the fallback
req.path), and in recordDuration use req.route?.path directly when available (no
normalization), otherwise normalize req.path; additionally, add an early return
in recordDuration to skip recording when the resolved route equals "/metrics" so
Prometheus scrapes are not observed. Reference functions: recordDuration and
normalizeRoute.

In `@apps/api/src/metrics/metrics.service.ts`:
- Around line 1-222: Add unit tests to cover MetricsService and
HttpMetricsInterceptor to meet coverage thresholds: create
metrics.service.spec.ts that exercises MetricsService.onModuleInit (simulate
successful and failing initial collectGauges by mocking
userRepository/pinnedCidRepository/folderIpnsRepository/republishScheduleRepository),
onModuleDestroy (ensuring interval cleared), collectGauges (verify gauge resets
and label setting behavior using mocked query builders returning different
ipns/republish rows), and the accessors getMetrics/getContentType; also add
http-metrics.interceptor.spec.ts that tests HttpMetricsInterceptor.intercept for
normal request timing, error path, and exclusion of the /metrics route (mock
execution context, call handler, and validate histogram/counter interactions).
Ensure tests spy/mock the prom-client Registry/Gauge/Counter/Histogram methods
so assertions verify increments/labels and that setInterval is handled (use fake
timers where needed).
- Around line 159-165: The logger calls in the initial and interval gauge
collection handlers (inside collectGauges() .catch blocks and the
collectInterval setup) use err.message which is undefined for non-Error throws;
update both catch handlers to normalize the thrown value before logging (e.g.,
if (err instanceof Error) use err.message else use String(err) or
util.inspect(err)) and include that normalized text in this.logger.warn so you
preserve diagnostic context when collectGauges rejects with strings or plain
objects.

In `@docker/MONITORING.md`:
- Line 267: Update the Dashboard Panels text that currently reads "Login
attempts by method" to match the metric and line 83 description: change it to
"Successful logins by method, new vs returning users" and ensure it references
the `cipherbox_auth_logins_total` metric name; locate the Dashboard Panels
section where the phrase "Login attempts by method" appears and replace the
wording so it is consistent with the `cipherbox_auth_logins_total` description
on line 83.

---

Duplicate comments:
In `@docker/MONITORING.md`:
- Line 83: The table entry for the metric `cipherbox_auth_logins_total` must
state that it counts successful logins only—update the markdown row so the
description reads something like "Number of successful logins" (and keep the
labels `method`, `new_user`), and ensure the metric's help string in the
instrumentation code (the `cipherbox_auth_logins_total` registration) matches
this revised description exactly.

---

Nitpick comments:
In `@apps/api/src/ipns/__tests__/ipns.integration.spec.ts`:
- Around line 170-173: Extract the repeated mock into a single factory function
(e.g., createMockMetricsService) declared before the first describe; have it
return { ipnsPublishes: { inc: jest.fn() }, ipnsResolves: { inc: jest.fn() } }.
Replace the three inline objects named mockMetricsService, mockMetricsService2,
and mockMetricsService3 by calling createMockMetricsService() wherever those
mocks are used (in the tests referencing the mock objects), so all mock shape
changes are centralized in the factory.

In `@apps/api/src/metrics/metrics.service.ts`:
- Line 16: The MetricsService class currently implements OnModuleInit but also
defines onModuleDestroy; update the class declaration to implement both
OnModuleInit and OnModuleDestroy (i.e., add OnModuleDestroy to the implements
clause on MetricsService) and ensure OnModuleDestroy is imported from NestJS so
TypeScript will validate the onModuleDestroy signature (refer to the
MetricsService class and the onModuleDestroy method).

In `@docker/Caddyfile`:
- Line 11: Update the Caddy respond matcher to cover the trailing-slash variant
by changing the matcher from "respond /metrics 403" to a prefix/wildcard form
such as "respond /metrics* 403" (or "/metrics/*" if you prefer exact prefix) so
both /metrics and /metrics/ are blocked; also verify the Alloy scrape target in
alloy-config.river points to the internal Docker service URL
"http://api:3000/metrics" (not the public api-staging hostname) so scraping
won’t be blocked by this respond rule.

- Fix route normalization: use req.route.path directly (already parameterised), only normalize raw req.path as fallback - Exclude /metrics from histogram recording, not just regex chain - Use resilient error logging (instanceof Error check) in gauge collection - Fix remaining "Login attempts" wording in dashboard panels section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

FSM1 changed the title ~~Add Prometheus metrics and Grafana dashboard for staging monitoring~~ feat(api): Add Prometheus metrics and Grafana dashboard for staging monitoring Feb 18, 2026

fix(ci): move VITE_WEB3AUTH_CLIENT_ID from secrets to vars

812db8d

Client IDs are not secrets — they're embedded in the built JS bundle. Using vars instead of secrets for this value across deploy-staging and e2e workflows. https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w

coderabbitai Bot reviewed Feb 18, 2026

View reviewed changes

FSM1 and others added 2 commits February 18, 2026 12:28

coderabbitai Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread apps/api/src/metrics/http-metrics.interceptor.ts

Comment thread apps/api/src/metrics/metrics.service.ts

Comment thread apps/api/src/metrics/metrics.service.ts Outdated

Comment thread docker/MONITORING.md Outdated

FSM1 merged commit 835d6c3 into main Feb 18, 2026
24 checks passed

cipherbox-release-bot Bot mentioned this pull request Feb 18, 2026

chore: release 0.10.0 [legacy] #142

Merged

FSM1 deleted the claude/grafana-staging-dashboards-HuV2C branch February 21, 2026 06:18

This was referenced Feb 25, 2026

feat(web): GDPR account deletion with IPFS unpin #202

Merged

fix(api,web): MFA REQUIRED_SHARE auth flow + E2E test coverage #213

Merged

This was referenced Mar 7, 2026

feat: add performance instrumentation for IPFS/IPNS and TEE operations #281

Merged

feat: IPNS resolution improvement with Someguy sidecar and latency metrics #284

Merged

cipherbox-release-bot Bot mentioned this pull request Mar 23, 2026

chore: release main #331

Closed

cipherbox-release-bot Bot mentioned this pull request Apr 1, 2026

chore: release main #427

Merged

This was referenced Jun 11, 2026

docs: protocol and privacy hardening exploration, plus v1.1 audit #481

Merged

feat(api): guard unpin with ownership check and reference-counted quota decrement #485

Merged

coderabbitai Bot mentioned this pull request Jun 25, 2026

chore(api): emit explicit zero for IPNS-entries gauge on empty table #561

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): Add Prometheus metrics and Grafana dashboard for staging monitoring#141

feat(api): Add Prometheus metrics and Grafana dashboard for staging monitoring#141
FSM1 merged 5 commits into
mainfrom
claude/grafana-staging-dashboards-HuV2C

FSM1 commented Feb 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 18, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FSM1 commented Feb 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Implementation Details

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FSM1 commented Feb 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 18, 2026 •

edited

Loading