Add source-side metric_name allow/deny filter to `prometheus_scrape` (prevents OOM on high-cardinality endpoints)

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment


### Use Cases

`prometheus_scrape` against a high-cardinality endpoint OOM-kills the agent **inside the source**, before any transform runs.

In our case a single broker exposes a `/metrics` payload of ~30 MiB / ~265k series. Vector parses all of it into `Vec<Event>` (each `Metric` carries an owned `BTreeMap<String, TagValueSet>`), peaks at ~450 MiB per scrape, and OOMs. Downstream we only ship <1K metric names — but by then the damage is done.

### Attempted Solutions

A downstream `filter` transform doesn't help because the parse has already happened. `tag_cardinality_limit`, `expire_metrics_secs`, scrape interval staggering, and `scrape_timeout_secs` don't address the per-scrape parse peak either. There is currently no way to drop metrics at the scrape layer.

### Proposal

Add two optional source-level fields applied to the raw exposition text **before** parsing:

```yaml
sources:
  pinot_broker:
    type: prometheus_scrape
    endpoints: [http://broker:8080/metrics]
    metric_name_allowlist:
      - pinot_broker_queries_OneMinuteRate
      - "pinot_broker_queryTotalTimeMs_*"
    metric_name_denylist:
      - "pinot_broker_*_99thPercentile"
```

- Shell-style globs (`glob` is already a workspace dep).
- Empty allowlist + empty denylist → `Cow::Borrowed(body)` returned — zero-copy fast path, identical bytes flow to `parse_text`. **Strictly additive; no behavior change for existing users.**
- Active filter walks the body line-by-line, preserves `# HELP` / `# TYPE` / unrelated comments, drops data lines whose name fails the predicate.
- Patch fits in one file (`src/sources/prometheus/scrape.rs`): two `Vec<String>` config fields, a small `MetricNameFilter` struct, a ~40-line helper, one extra line in `on_response` between `from_utf8_lossy` and `parse_text`.

In our test cluster this collapses the per-scrape peak from ~450 MiB to <30 MiB and stops the OOM loop on a 1 GiB pod, with no change for sources that don't set the new fields.

I have a working implementation against current `master` with unit tests; happy to open a PR if maintainers are open to this direction. Narrower in scope than #18304's full `metric_relabel_configs` — that could still be layered on later.

### References

- #18304 — broader proposal for `metric_relabel_configs`; this issue proposes a narrow first step
- #21655 — long-running OOM thread, same root cause
- #22893 — related but distinct (jemalloc retention, not parse-time blowup)

### Version

Reproduced on `0.43.0` and current `master` (`0.49.0`).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add source-side metric_name allow/deny filter to `prometheus_scrape` (prevents OOM on high-cardinality endpoints) #25266

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add source-side metric_name allow/deny filter to prometheus_scrape (prevents OOM on high-cardinality endpoints) #25266

Description

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add source-side metric_name allow/deny filter to `prometheus_scrape` (prevents OOM on high-cardinality endpoints) #25266