feat(api): Add Prometheus metrics and Grafana dashboard for staging monitoring#141
Conversation
Add prom-client to the API with a MetricsModule that exposes /metrics for Grafana Alloy to scrape. Metrics include DB-polled gauges (users, files, storage, IPNS entries, republish schedule status) and event-driven counters (uploads, downloads, IPNS publishes/resolves, republish results, auth logins) plus an HTTP request duration histogram. - New MetricsModule (global) with MetricsService, MetricsController - HttpMetricsInterceptor for request duration tracking - Instrumented IpfsController, IpnsController, RepublishProcessor, AuthController - Updated Alloy config to scrape /metrics and remote-write to Grafana Cloud Mimir - Added Prometheus env vars to docker-compose.staging.yml and deploy workflow - Pre-built Grafana dashboard JSON with 20+ panels across 6 sections - Updated MONITORING.md with full metrics reference and setup instructions https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. WalkthroughAdds Prometheus-based observability: new MetricsService, HttpMetricsInterceptor, /metrics endpoint, instrumentation across controllers/processors and tests, Prometheus scrape/remote_write and Grafana dashboard, plus CI/docker staging env updates and Caddy rule to block /metrics. (≤50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Caddy
participant API
participant MetricsService
participant Prometheus
participant Grafana
Client->>Caddy: HTTP request (/api/...)
Caddy->>API: proxy request
API->>API: HttpMetricsInterceptor records start time
API->>API: Controller executes business logic
API->>MetricsService: increment counters / observe histogram
API-->>Client: response
Prometheus->>API: scrape GET /metrics
API->>MetricsService: MetricsService.getMetrics()
API-->>Prometheus: metrics payload (text/plain)
Grafana->>Prometheus: query metrics (PromQL) / fetch dashboards
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Client IDs are not secrets — they're embedded in the built JS bundle. Using vars instead of secrets for this value across deploy-staging and e2e workflows. https://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/api/package.json (1)
1-1:⚠️ Potential issue | 🟡 MinorBranch coverage threshold failing (76.18% < 80%) must be resolved before merge.
The new
MetricsModule,MetricsService,MetricsController, andHttpMetricsInterceptorare added without accompanying tests, which dropped branch coverage below the required 80%. The processor test is also broken (seerepublish.processor.ts). Unit tests for the metrics subsystem need to be added before this can merge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/package.json` at line 1, Branch coverage fell below threshold because new MetricsModule, MetricsService, MetricsController, and HttpMetricsInterceptor lack unit tests and there's a broken test in republish.processor.ts; add focused unit tests for each new symbol (MetricsModule, MetricsService, MetricsController, HttpMetricsInterceptor) exercising branches (success, error, edge cases and config toggles) and update or fix the republish.processor.ts test to restore its expected behavior (mock dependencies, assert side effects, and handle async flows). Ensure you create mocks for injected dependencies, cover error paths and conditional branches in methods of MetricsService and HttpMetricsInterceptor, add controller tests for endpoints in MetricsController, and include the module-level integration tests for MetricsModule so overall branch coverage meets >=80%.
🧹 Nitpick comments (8)
apps/api/src/metrics/metrics.controller.ts (1)
12-16: Consider@Res({ passthrough: true })to keep NestJS response lifecycle active.With the default
@Res()(passthrough off), NestJS skips global interceptors and response transformations for this route. While the metrics endpoint doesn't need them, it also means an unhandled exception fromgetMetrics()may not be cleaned up by the standard exception filter. Usingpassthrough: trueand returning the value is the idiomatic NestJS approach when only header mutation is needed.♻️ Optional refactor
-async getMetrics(`@Res`() res: Response): Promise<void> { - const metrics = await this.metricsService.getMetrics(); - res.set('Content-Type', this.metricsService.getContentType()); - res.send(metrics); -} +async getMetrics(`@Res`({ passthrough: true }) res: Response): Promise<string> { + res.set('Content-Type', this.metricsService.getContentType()); + return this.metricsService.getMetrics(); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.controller.ts` around lines 12 - 16, The getMetrics handler currently uses `@Res`() which disables NestJS response lifecycle; change the decorator to `@Res`({ passthrough: true }) on the getMetrics method, keep using res.set('Content-Type', this.metricsService.getContentType()) to set headers, and return the metrics value from metricsService.getMetrics() instead of calling res.send(metrics) so global interceptors and exception filters remain active; update references in the getMetrics method to use the returned value flow with the metricsService.getMetrics and metricsService.getContentType calls.docker/docker-compose.staging.yml (1)
136-137: Consider addingcondition: service_healthytoalloy'sapidependency.The
apiservice has no healthcheck, sodepends_on: - apionly waits for the container to start, not for the API to be ready. The first few scrapes will fail silently; Prometheus is tolerant of this, but making the dependency explicit with a healthcheck on the API container would give a cleaner startup ordering.🔧 Optional improvement
Add a healthcheck to the
apiservice:api: image: ... + healthcheck: + test: ['CMD-SHELL', 'wget -qO- http://localhost:3000/health || exit 1'] + interval: 10s + timeout: 5s + retries: 5 + start_period: 20sThen tighten
alloy's dependency:- depends_on: - - api + depends_on: + api: + condition: service_healthy🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/docker-compose.staging.yml` around lines 136 - 137, Add a Docker healthcheck to the api service (e.g., a lightweight curl or tcp check that returns 0 when the API is responsive, with sensible interval/retries) and then update alloy's depends_on entry for api to use condition: service_healthy so alloy waits for the API to be ready (reference the service names "api" and "alloy" and edit the docker-compose service definitions for those services).docker/grafana/dashboards/cipherbox-staging.json (1)
818-834: Consider adding P95 threshold annotations for key routes.The project has documented performance targets (P95): auth flow <3s, file upload/download <5s, IPNS publish <2s. Adding constant threshold lines to the response time panels would make it easy to spot regressions at a glance. This can be done as a follow-up.
Based on learnings: "Performance benchmark targets (P95): auth flow <3s, file upload/download (<100MB) <5s, IPNS publish <2s, folder create <1s, multi-device sync latency <30s."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/grafana/dashboards/cipherbox-staging.json` around lines 818 - 834, This panel currently plots histogram_quantile p50/p95/p99 (refId A/B/C); add explicit P95 threshold lines by appending new targets with unique refIds (e.g., D,E,F,...) where each target's expr is a scalar constant (e.g., scalar(3), scalar(5), scalar(2), scalar(1), scalar(30)) and legendFormat describes the threshold (auth_p95_threshold, upload_p95_threshold, ipns_publish_p95_threshold, folder_create_p95_threshold, multi_device_sync_p95_threshold); ensure these new targets are styled as thin/dashed colored lines (and optionally excluded from main legend) so they appear as constant threshold annotations on the same response-time panels for easy visual comparison against the histogram_quantile curves.apps/api/src/metrics/metrics.service.ts (5)
18-18:registryispublicbut the class already exposesgetMetrics()andgetContentType()as its intended API surfaceExposing the raw
Registryinstance allows any consumer to register arbitrary metrics into this registry, reset it, or callregistry.clear(). Making itprivate(orprotected) enforces encapsulation without breaking any existing usage, since the MetricsController only callsgetMetrics().♻️ Proposed fix
- readonly registry: client.Registry; + private readonly registry: client.Registry;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.service.ts` at line 18, The public Registry field exposes internal state; change the readonly registry: client.Registry declaration in the MetricsService class to private (or protected) so external callers cannot mutate the registry directly, keeping the public API via getMetrics() and getContentType(); update any internal references inside MetricsService to use the now-private registry and adjust any tests or consumers that referenced metricsService.registry to call the public methods instead.
16-16:OnModuleDestroyis missing from theimplementsclause
onModuleDestroy()exists and is correct, but the class only declaresimplements OnModuleInit. NestJS discovers lifecycle hooks by duck-typing so the interval will be cleared, but omitting the interface declaration removes the TypeScript compiler check (e.g., typo in method name or wrong return type would go undetected).♻️ Proposed fix
-import { Injectable, OnModuleInit, Logger } from '@nestjs/common'; +import { Injectable, OnModuleInit, OnModuleDestroy, Logger } from '@nestjs/common'; ... -export class MetricsService implements OnModuleInit { +export class MetricsService implements OnModuleInit, OnModuleDestroy {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.service.ts` at line 16, The class MetricsService declares lifecycle method onModuleDestroy but only implements OnModuleInit; update the class declaration to also implement OnModuleDestroy (i.e., change "implements OnModuleInit" to "implements OnModuleInit, OnModuleDestroy") and ensure OnModuleDestroy is imported from NestJS (where OnModuleInit is imported) so the TypeScript compiler enforces the lifecycle hook signature for onModuleDestroy.
56-86: Gauge metrics should not carry the_totalsuffix — that suffix is reserved for counters by Prometheus convention
cipherbox_users_total,cipherbox_files_total,cipherbox_storage_bytes_total,cipherbox_ipns_entries_total, andcipherbox_republish_schedule_totalare allclient.Gaugeinstances. The_totalsuffix signals a monotonically increasing counter to Prometheus tooling and dashboards, which conflicts with the mutable/resettable semantics of these gauge metrics. Consider renaming them to drop_total(e.g.,cipherbox_users,cipherbox_files,cipherbox_storage_bytes,cipherbox_ipns_entries,cipherbox_republish_schedule).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.service.ts` around lines 56 - 86, The Gauges this.usersTotal, this.filesTotal, this.storageBytesTotal, this.ipnsEntriesTotal, and this.republishScheduleTotal are named with a `_total` suffix which is reserved for counters; rename the metric name strings in each Gauge definition to drop `_total` (e.g., change 'cipherbox_users_total' → 'cipherbox_users', 'cipherbox_files_total' → 'cipherbox_files', 'cipherbox_storage_bytes_total' → 'cipherbox_storage_bytes', 'cipherbox_ipns_entries_total' → 'cipherbox_ipns_entries', 'cipherbox_republish_schedule_total' → 'cipherbox_republish_schedule') and update any code that references those metric names (places that call set(), labels(), inc(), etc. on these Gauge instances) to use the updated Gauge objects/names so instrumentation and dashboards remain consistent.
1-219: Unauthenticated/metricsendpoint exposes operational intelligence; verify API client regenerationTwo operational concerns for this PR:
Security posture: The
/metricsendpoint is intentionally unauthenticated. It exposes user/file/storage counts, authentication patterns (method,new_userlabels), IPNS record-type distributions, and HTTP route latency histograms. For a staging environment reachable from outside the private network this constitutes a meaningful information-disclosure surface. Consider at minimum a static bearer token check inMetricsController, IP-allow-listing at the load balancer, or binding the endpoint to a separate internal-only port.API client regeneration: The PR introduces a new
MetricsControllerunderapps/api. As per the coding guidelines, "After modifying API endpoints, DTOs, or controllers inapps/api, regenerate the API client by runningpnpm api:generateto keep the web app type-safe and in sync, and commit the regenerated client files." Even though the endpoint is decorated with@ApiExcludeEndpoint(), the regeneration step should be run and any changed (or unchanged) client files committed to confirm the spec is stable. As per coding guidelines: "After modifying API endpoints, DTOs, or controllers inapps/api, regenerate the API client by runningpnpm api:generate."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.service.ts` around lines 1 - 219, Metrics endpoint is currently unauthenticated and the new controller requires coordination with the API client; update the MetricsController to restrict access (e.g., enforce a static bearer token or guard that checks an internal-only header/IP or apply NestJS AuthGuard) so that calls to the controller's methods (the route that calls MetricsService.getMetrics / getContentType) are protected, then run pnpm api:generate and commit the regenerated API client files to keep the web app types in sync.
157-165: Use prom-client's built-incollect()function for gauge population instead of manualsetIntervalThe prom-client documentation recommends using
collect()as the idiomatic pattern for gauge updates. Thecollect()function is invoked on each scrape (when/metricsis called), ensuring values are always fresh at scrape time and eliminating the startup timing gap. This removes the need for manualsetInterval/onModuleDestroylifecycle management.For async operations, pass an async function (not an arrow function—
thisbinding matters) directly in the gauge configuration:this.usersTotal = new client.Gauge({ name: 'cipherbox_users_total', help: 'Total registered users', registers: [this.registry], collect: async function () { this.set(await this.userRepository.count()); }, });Note: Since
collect()runs on the scrape path, consider whether querying the database on every scrape is appropriate. If the current 30-second polling interval is important for performance, you may instead usecollect()with cached results updated by a background timer.Apply to all five gauges (lines 56–86) and remove
onModuleInit/onModuleDestroy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/api/src/auth/auth.controller.ts`:
- Around line 70-73: The authLogins counter is only incremented on successful
authentication; wrap the call to this.authService.login(...) in a try/catch and
update metric emission to include a result label so both successes and failures
are recorded: on success call this.metricsService.authLogins.inc({ method:
'web3auth', result: 'success', new_user: String(result.isNewUser) }); on error
catch UnauthorizedException (or any thrown error), call
this.metricsService.authLogins.inc({ method: 'web3auth', result: 'failure',
new_user: 'false' }) (or appropriate default), then rethrow the error so
behavior is unchanged; update references to metricsService.authLogins.inc and
authService.login accordingly.
In `@apps/api/src/ipns/ipns.controller.ts`:
- Around line 109-113: The batch publish metric currently counts both successes
and failures; update the increment in the IPNS controller so
metricsService.ipnsPublishes.inc({ type: 'batch' }, ...) uses only
result.totalSucceeded (not totalSucceeded + totalFailed) to match single publish
behavior (see ipnsService.publishBatch and the single publish path that only
increments on success), or alternatively add a label like {type: 'batch',
outcome: 'succeeded'|'failed'} and call inc separately for succeeded and failed
counts to make success/failure explicit.
In `@apps/api/src/metrics/http-metrics.interceptor.ts`:
- Around line 21-26: The error branch in the interceptor records statusCode from
the response too early; change the error handler in http-metrics.interceptor.ts
to derive status by inspecting the thrown error (check instanceof HttpException
or presence of getStatus()) and call error.getStatus() when available, falling
back to 500 for unknown errors, then pass that status into recordDuration(req,
statusCode, startTime); ensure you import HttpException from `@nestjs/common` and
leave the success/complete paths unchanged.
In `@apps/api/src/metrics/metrics.controller.ts`:
- Around line 6-16: The /metrics endpoint is publicly exposed via
MetricsController.getMetrics and leaks sensitive operational data; either block
it at the reverse proxy by adding a Caddy-level deny for /metrics (e.g., add a
respond /metrics 403 before the reverse_proxy) or secure it in the application
by adding an authentication guard to MetricsController (decorate the class or
getMetrics with `@UseGuards`(JwtAuthGuard) and require Bearer tokens) and ensure
MetricsService.getContentType/getMetrics remain accessible only to authenticated
requests; pick one approach and implement the corresponding config change or
decorator to prevent unauthenticated access.
In `@apps/api/src/metrics/metrics.service.ts`:
- Around line 157-165: The initial await this.collectGauges() in onModuleInit
can throw and abort startup; wrap that call in a try/catch so errors are caught
and logged (e.g., this.logger.warn(`Gauge collection failed: ${err.message}`))
similar to the setInterval handler, ensuring collectInterval is still created
and the method completes normally; update onModuleInit to call collectGauges()
inside a try/catch block (referencing onModuleInit, collectGauges,
collectInterval, and this.logger) so startup is resilient to transient DB
errors.
In `@apps/api/src/republish/republish.processor.ts`:
- Around line 11-26: Tests for RepublishProcessor fail because the DI test
module doesn't provide a MetricsService, causing reads of
republishRuns/republishEntriesProcessed to throw; update the RepublishProcessor
test setup to add a mock provider for MetricsService that supplies republishRuns
and republishEntriesProcessed objects with an inc method (jest.fn() or
equivalent) and any other metric methods used, and register this provider in the
Test.createTestingModule providers so the injected MetricsService used by
RepublishProcessor.process (and by RepublishService if needed) is defined during
tests.
In `@docker/MONITORING.md`:
- Line 83: Update the metric description for cipherbox_auth_logins_total to
reflect its actual semantics: change "Login attempts" to "Successful logins" (or
"Login events") in MONITORING.md; ensure this matches the implementation where
the counter authLogins in AuthController is incremented only after a successful
login so documentation and metric name/labels remain consistent.
---
Outside diff comments:
In `@apps/api/package.json`:
- Line 1: Branch coverage fell below threshold because new MetricsModule,
MetricsService, MetricsController, and HttpMetricsInterceptor lack unit tests
and there's a broken test in republish.processor.ts; add focused unit tests for
each new symbol (MetricsModule, MetricsService, MetricsController,
HttpMetricsInterceptor) exercising branches (success, error, edge cases and
config toggles) and update or fix the republish.processor.ts test to restore its
expected behavior (mock dependencies, assert side effects, and handle async
flows). Ensure you create mocks for injected dependencies, cover error paths and
conditional branches in methods of MetricsService and HttpMetricsInterceptor,
add controller tests for endpoints in MetricsController, and include the
module-level integration tests for MetricsModule so overall branch coverage
meets >=80%.
---
Duplicate comments:
In `@apps/api/src/auth/auth.controller.ts`:
- Around line 48-51: The AuthController constructor injects MetricsService
causing the same DI failure in tests; update the AuthController test setup to
include a mock provider for MetricsService (like the one added for
IpfsController tests) so the test module can resolve the dependency — add a
provider with provide: MetricsService and useValue: { increment: jest.fn(),
timing: jest.fn(), /* any used methods */ } or equivalent mock object in the
Test.createTestingModule providers array used for testing AuthController.
In `@apps/api/src/ipns/ipns.controller.ts`:
- Around line 36-39: The constructor injection for MetricsService in the
IpnsController is causing the same circular DI failure; update the constructor
parameter to use Nest's forwardRef injection (use `@Inject`(forwardRef(() =>
MetricsService)) on the metricsService parameter) and add/import Inject and
forwardRef from '@nestjs/common' so the IpnsController's constructor signature
mirrors the fix applied to IpfsController (keep the type as MetricsService).
---
Nitpick comments:
In `@apps/api/src/metrics/metrics.controller.ts`:
- Around line 12-16: The getMetrics handler currently uses `@Res`() which disables
NestJS response lifecycle; change the decorator to `@Res`({ passthrough: true })
on the getMetrics method, keep using res.set('Content-Type',
this.metricsService.getContentType()) to set headers, and return the metrics
value from metricsService.getMetrics() instead of calling res.send(metrics) so
global interceptors and exception filters remain active; update references in
the getMetrics method to use the returned value flow with the
metricsService.getMetrics and metricsService.getContentType calls.
In `@apps/api/src/metrics/metrics.service.ts`:
- Line 18: The public Registry field exposes internal state; change the readonly
registry: client.Registry declaration in the MetricsService class to private (or
protected) so external callers cannot mutate the registry directly, keeping the
public API via getMetrics() and getContentType(); update any internal references
inside MetricsService to use the now-private registry and adjust any tests or
consumers that referenced metricsService.registry to call the public methods
instead.
- Line 16: The class MetricsService declares lifecycle method onModuleDestroy
but only implements OnModuleInit; update the class declaration to also implement
OnModuleDestroy (i.e., change "implements OnModuleInit" to "implements
OnModuleInit, OnModuleDestroy") and ensure OnModuleDestroy is imported from
NestJS (where OnModuleInit is imported) so the TypeScript compiler enforces the
lifecycle hook signature for onModuleDestroy.
- Around line 56-86: The Gauges this.usersTotal, this.filesTotal,
this.storageBytesTotal, this.ipnsEntriesTotal, and this.republishScheduleTotal
are named with a `_total` suffix which is reserved for counters; rename the
metric name strings in each Gauge definition to drop `_total` (e.g., change
'cipherbox_users_total' → 'cipherbox_users', 'cipherbox_files_total' →
'cipherbox_files', 'cipherbox_storage_bytes_total' → 'cipherbox_storage_bytes',
'cipherbox_ipns_entries_total' → 'cipherbox_ipns_entries',
'cipherbox_republish_schedule_total' → 'cipherbox_republish_schedule') and
update any code that references those metric names (places that call set(),
labels(), inc(), etc. on these Gauge instances) to use the updated Gauge
objects/names so instrumentation and dashboards remain consistent.
- Around line 1-219: Metrics endpoint is currently unauthenticated and the new
controller requires coordination with the API client; update the
MetricsController to restrict access (e.g., enforce a static bearer token or
guard that checks an internal-only header/IP or apply NestJS AuthGuard) so that
calls to the controller's methods (the route that calls
MetricsService.getMetrics / getContentType) are protected, then run pnpm
api:generate and commit the regenerated API client files to keep the web app
types in sync.
In `@docker/docker-compose.staging.yml`:
- Around line 136-137: Add a Docker healthcheck to the api service (e.g., a
lightweight curl or tcp check that returns 0 when the API is responsive, with
sensible interval/retries) and then update alloy's depends_on entry for api to
use condition: service_healthy so alloy waits for the API to be ready (reference
the service names "api" and "alloy" and edit the docker-compose service
definitions for those services).
In `@docker/grafana/dashboards/cipherbox-staging.json`:
- Around line 818-834: This panel currently plots histogram_quantile p50/p95/p99
(refId A/B/C); add explicit P95 threshold lines by appending new targets with
unique refIds (e.g., D,E,F,...) where each target's expr is a scalar constant
(e.g., scalar(3), scalar(5), scalar(2), scalar(1), scalar(30)) and legendFormat
describes the threshold (auth_p95_threshold, upload_p95_threshold,
ipns_publish_p95_threshold, folder_create_p95_threshold,
multi_device_sync_p95_threshold); ensure these new targets are styled as
thin/dashed colored lines (and optionally excluded from main legend) so they
appear as constant threshold annotations on the same response-time panels for
easy visual comparison against the histogram_quantile curves.
- Fix HTTP metrics interceptor to derive status from exception (not response) - Fix batch IPNS publish metric to only count succeeded records - Make initial gauge collection resilient to DB unavailability - Block /metrics endpoint externally via Caddy - Add MetricsService mocks to all test files missing them - Rename "Login attempts" to "Successful logins" in docs and help text Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Infrastructure code like the metrics module (MetricsService, MetricsController, HttpMetricsInterceptor) is excluded from coverage, consistent with the existing health/ exclusion pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
docker/Caddyfile (1)
11-11: LGTM — optionally harden the path matcher to cover the trailing-slash variant.The directive is syntactically correct. The first non-matcher argument to
respondcan be a 3-digit status code, so/metricsis correctly parsed as the inline path matcher and403as the status, and Caddy's built-in directive order placesrespondbeforereverse_proxy, so the 403 short-circuits before any proxying regardless of file position.One minor hardening note: Caddy path matching is exact, not prefix-based; wildcards at the end (
/prefix/*) are required for a prefix match. This meansrespond /metrics 403does not blockGET /metrics/(trailing slash). In practice this is low risk since prom-client doesn't register sub-paths, but/metrics*would be more defensive:🛡️ Optional: cover trailing-slash variant
- respond /metrics 403 + respond /metrics* 403Also confirm that the Alloy
alloy-config.riverscrape target points tohttp://api:3000/metrics(direct Docker service address) and not the publicapi-staging.cipherbox.cchostname — otherwise scraping would itself receive the 403.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/Caddyfile` at line 11, Update the Caddy respond matcher to cover the trailing-slash variant by changing the matcher from "respond /metrics 403" to a prefix/wildcard form such as "respond /metrics* 403" (or "/metrics/*" if you prefer exact prefix) so both /metrics and /metrics/ are blocked; also verify the Alloy scrape target in alloy-config.river points to the internal Docker service URL "http://api:3000/metrics" (not the public api-staging hostname) so scraping won’t be blocked by this respond rule.apps/api/src/metrics/metrics.service.ts (1)
16-16: AddOnModuleDestroyto theimplementsclause.
onModuleDestroyis implemented at line 170 butOnModuleDestroyis missing from the interface list. NestJS discovers lifecycle hooks by method name at runtime so cleanup still runs, but TypeScript won't validate the signature if it's ever accidentally renamed or mistyped.♻️ Proposed fix
-import { Injectable, OnModuleInit, Logger } from '@nestjs/common'; +import { Injectable, OnModuleInit, OnModuleDestroy, Logger } from '@nestjs/common'; -export class MetricsService implements OnModuleInit { +export class MetricsService implements OnModuleInit, OnModuleDestroy {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/metrics/metrics.service.ts` at line 16, The MetricsService class currently implements OnModuleInit but also defines onModuleDestroy; update the class declaration to implement both OnModuleInit and OnModuleDestroy (i.e., add OnModuleDestroy to the implements clause on MetricsService) and ensure OnModuleDestroy is imported from NestJS so TypeScript will validate the onModuleDestroy signature (refer to the MetricsService class and the onModuleDestroy method).apps/api/src/ipns/__tests__/ipns.integration.spec.ts (1)
170-173: Extract the repeated mock into a shared factory to remove duplication.
mockMetricsService,mockMetricsService2, andmockMetricsService3are structurally identical but declared three times. A single factory avoids the sequential suffix naming and makes future mock shape changes a one-line edit.♻️ Proposed refactor
Add a helper before the first
describeblock:+function makeMetricsServiceMock() { + return { + ipnsPublishes: { inc: jest.fn() }, + ipnsResolves: { inc: jest.fn() }, + }; +}Then replace each inline mock declaration:
- const mockMetricsService = { - ipnsPublishes: { inc: jest.fn() }, - ipnsResolves: { inc: jest.fn() }, - }; + const mockMetricsService = makeMetricsServiceMock();- const mockMetricsService2 = { - ipnsPublishes: { inc: jest.fn() }, - ipnsResolves: { inc: jest.fn() }, - }; + const mockMetricsService2 = makeMetricsServiceMock();- const mockMetricsService3 = { - ipnsPublishes: { inc: jest.fn() }, - ipnsResolves: { inc: jest.fn() }, - }; + const mockMetricsService3 = makeMetricsServiceMock();Also applies to: 331-334
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/api/src/ipns/__tests__/ipns.integration.spec.ts` around lines 170 - 173, Extract the repeated mock into a single factory function (e.g., createMockMetricsService) declared before the first describe; have it return { ipnsPublishes: { inc: jest.fn() }, ipnsResolves: { inc: jest.fn() } }. Replace the three inline objects named mockMetricsService, mockMetricsService2, and mockMetricsService3 by calling createMockMetricsService() wherever those mocks are used (in the tests referencing the mock objects), so all mock shape changes are centralized in the factory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/api/src/metrics/http-metrics.interceptor.ts`:
- Around line 36-57: normalizeRoute is applied to both matched route templates
and raw paths causing static templates like "/ipns/publish" to be mangled and
`/metrics` scrapes still get recorded; fix by making normalizeRoute only perform
regex replacements when given a raw path (i.e., call it only for the fallback
req.path), and in recordDuration use req.route?.path directly when available (no
normalization), otherwise normalize req.path; additionally, add an early return
in recordDuration to skip recording when the resolved route equals "/metrics" so
Prometheus scrapes are not observed. Reference functions: recordDuration and
normalizeRoute.
In `@apps/api/src/metrics/metrics.service.ts`:
- Around line 1-222: Add unit tests to cover MetricsService and
HttpMetricsInterceptor to meet coverage thresholds: create
metrics.service.spec.ts that exercises MetricsService.onModuleInit (simulate
successful and failing initial collectGauges by mocking
userRepository/pinnedCidRepository/folderIpnsRepository/republishScheduleRepository),
onModuleDestroy (ensuring interval cleared), collectGauges (verify gauge resets
and label setting behavior using mocked query builders returning different
ipns/republish rows), and the accessors getMetrics/getContentType; also add
http-metrics.interceptor.spec.ts that tests HttpMetricsInterceptor.intercept for
normal request timing, error path, and exclusion of the /metrics route (mock
execution context, call handler, and validate histogram/counter interactions).
Ensure tests spy/mock the prom-client Registry/Gauge/Counter/Histogram methods
so assertions verify increments/labels and that setInterval is handled (use fake
timers where needed).
- Around line 159-165: The logger calls in the initial and interval gauge
collection handlers (inside collectGauges() .catch blocks and the
collectInterval setup) use err.message which is undefined for non-Error throws;
update both catch handlers to normalize the thrown value before logging (e.g.,
if (err instanceof Error) use err.message else use String(err) or
util.inspect(err)) and include that normalized text in this.logger.warn so you
preserve diagnostic context when collectGauges rejects with strings or plain
objects.
In `@docker/MONITORING.md`:
- Line 267: Update the Dashboard Panels text that currently reads "Login
attempts by method" to match the metric and line 83 description: change it to
"Successful logins by method, new vs returning users" and ensure it references
the `cipherbox_auth_logins_total` metric name; locate the Dashboard Panels
section where the phrase "Login attempts by method" appears and replace the
wording so it is consistent with the `cipherbox_auth_logins_total` description
on line 83.
---
Duplicate comments:
In `@docker/MONITORING.md`:
- Line 83: The table entry for the metric `cipherbox_auth_logins_total` must
state that it counts successful logins only—update the markdown row so the
description reads something like "Number of successful logins" (and keep the
labels `method`, `new_user`), and ensure the metric's help string in the
instrumentation code (the `cipherbox_auth_logins_total` registration) matches
this revised description exactly.
---
Nitpick comments:
In `@apps/api/src/ipns/__tests__/ipns.integration.spec.ts`:
- Around line 170-173: Extract the repeated mock into a single factory function
(e.g., createMockMetricsService) declared before the first describe; have it
return { ipnsPublishes: { inc: jest.fn() }, ipnsResolves: { inc: jest.fn() } }.
Replace the three inline objects named mockMetricsService, mockMetricsService2,
and mockMetricsService3 by calling createMockMetricsService() wherever those
mocks are used (in the tests referencing the mock objects), so all mock shape
changes are centralized in the factory.
In `@apps/api/src/metrics/metrics.service.ts`:
- Line 16: The MetricsService class currently implements OnModuleInit but also
defines onModuleDestroy; update the class declaration to implement both
OnModuleInit and OnModuleDestroy (i.e., add OnModuleDestroy to the implements
clause on MetricsService) and ensure OnModuleDestroy is imported from NestJS so
TypeScript will validate the onModuleDestroy signature (refer to the
MetricsService class and the onModuleDestroy method).
In `@docker/Caddyfile`:
- Line 11: Update the Caddy respond matcher to cover the trailing-slash variant
by changing the matcher from "respond /metrics 403" to a prefix/wildcard form
such as "respond /metrics* 403" (or "/metrics/*" if you prefer exact prefix) so
both /metrics and /metrics/ are blocked; also verify the Alloy scrape target in
alloy-config.river points to the internal Docker service URL
"http://api:3000/metrics" (not the public api-staging hostname) so scraping
won’t be blocked by this respond rule.
- Fix route normalization: use req.route.path directly (already parameterised), only normalize raw req.path as fallback - Exclude /metrics from histogram recording, not just regex chain - Use resilient error logging (instanceof Error check) in gauge collection - Fix remaining "Login attempts" wording in dashboard panels section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
This PR adds comprehensive Prometheus metrics collection and a pre-built Grafana dashboard to monitor the CipherBox staging environment. Metrics are scraped from the API's
/metricsendpoint and shipped to Grafana Cloud Mimir, complementing the existing Loki log aggregation.Key Changes
Metrics Service (
apps/api/src/metrics/)MetricsService: Central registry managing Prometheus metrics (gauges, counters, histograms)MetricsController: Exposes/metricsendpoint (excluded from Swagger, unauthenticated)HttpMetricsInterceptor: Captures HTTP request metrics globally across all routesIntegration
MetricsModuleregistered as global module inAppModuleHttpMetricsInterceptorregistered as global interceptorAuthController(logins),IpnsController(publishes/resolves),IpfsController(uploads/downloads),RepublishProcessor(republish results)Monitoring Infrastructure
alloy-config.river: Added Prometheus scrape job for API/metricsendpoint (30s interval) with remote write to Grafana Cloud Mimircipherbox-staging.json: Complete Grafana dashboard with 20+ panels covering:Configuration
GRAFANA_PROMETHEUS_URL,GRAFANA_PROMETHEUS_USERNAME,GRAFANA_PROMETHEUS_API_KEYMONITORING.mdwith Prometheus setup instructions and metrics referenceDependencies
prom-client@^15.1.3for Prometheus client libraryImplementation Details
/metricsendpoint itself to prevent self-referential loopshttps://claude.ai/code/session_01FunxpyxKdYX4mhfCFuYF8w
Summary by CodeRabbit
New Features
Documentation
Chores
Tests