Skip to content

[FLINK-39059][models] Add unified inference metrics for model functions#27724

Open
dubin555 wants to merge 1 commit intoapache:masterfrom
dubin555:oss-scout/verify-add-model-inference-metrics
Open

[FLINK-39059][models] Add unified inference metrics for model functions#27724
dubin555 wants to merge 1 commit intoapache:masterfrom
dubin555:oss-scout/verify-add-model-inference-metrics

Conversation

@dubin555
Copy link

@dubin555 dubin555 commented Mar 2, 2026

What is the purpose of the change

The flink-models module (both flink-model-triton and flink-model-openai) currently has no metric instrumentation. Users running model inference in production have no visibility into request rates, error rates, or latency — making it impossible to set up monitoring or alerting for inference degradation.

This PR adds unified inference metrics to both Triton and OpenAI model function base classes, following the same MetricGroup patterns used elsewhere in Flink (e.g., CachingAsyncLookupFunction, AsyncMLPredictRunner).

Four metrics are registered under the model_inference group:

Metric Type Description
inference_requests Counter Total inference requests initiated
inference_requests_success Counter Successful inference completions
inference_requests_failure Counter Failed requests (network, HTTP, parse errors)
inference_latency_ms Gauge Last inference round-trip time in ms

Brief change log

  • Added metric fields and registerMetrics() call in AbstractTritonModelFunction.open() so all Triton subclasses automatically get metrics
  • Instrumented all success/failure paths in TritonInferenceModelFunction.asyncPredict() with counter increments and latency tracking
  • Added metric fields, registration, and whenComplete() instrumentation in AbstractOpenAIModelFunction.open() / asyncPredict()
  • Null inputs and context-overflow-skipped inputs in OpenAI are filtered before incrementing request counts to avoid inflation
  • Used a volatile long gauge for latency rather than histogram, since DescriptiveStatisticsHistogram lives in flink-runtime which is not available as a dependency in flink-models

Verifying this change

This change added tests and can be verified as follows:

  • Added TritonInferenceMetricsTest — integration test using MockWebServer that verifies metrics are correctly registered and updated after successful inference calls
  • Added OpenAIInferenceMetricsTest — integration test using MockWebServer that verifies chat inference metrics and null-input skip behavior
  • Existing tests pass unchanged (no behavioral changes to inference logic)

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

Add inference metrics (request count, success/failure counters, latency
gauge) to both Triton and OpenAI model inference functions. The flink-models
module previously had zero MetricGroup/Counter/Gauge references, making it
impossible to monitor model inference performance in production.

Metrics registered under "model_inference" group:
- inference_requests: total inference requests
- inference_requests_success: successful completions
- inference_requests_failure: failed requests (network, HTTP errors, parse)
- inference_latency_ms: last inference round-trip latency
@flinkbot
Copy link
Collaborator

flinkbot commented Mar 2, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants