Skip to content

feat(opamp): compute and store effective_config_hash for collector pipelines#6872

Draft
juliaElastic wants to merge 17 commits intoelastic:mainfrom
juliaElastic:hash-effective-config
Draft

feat(opamp): compute and store effective_config_hash for collector pipelines#6872
juliaElastic wants to merge 17 commits intoelastic:mainfrom
juliaElastic:hash-effective-config

Conversation

@juliaElastic
Copy link
Copy Markdown
Contributor

@juliaElastic juliaElastic commented Apr 21, 2026

Summary

  • Computes a SHA-256 hash of the OTel collector's effective config and stores it as effective_config_hash in the .fleet-agents index.
  • Keys are canonicalized via yaml.v3 Marshal (sorts alphabetically) before hashing, so key order never affects the output.
  • Hash is computed from the raw YAML body (before the sensitive-value redaction pass that produces effective_config).

Closes: https://github.com/elastic/ingest-dev/issues/7064

To verify:

GET .fleet-agents/_search
{
    "_source": ["effective_config_hash", "effective_config"]
}

    "hits": [
      {
        "_index": ".fleet-agents-7",
        "_id": "95d87b10-299e-45ce-a5dd-72e500451e48",
        "_score": 1,
        "_source": {
          "effective_config": {
           ...
          },
          "effective_config_hash": "4e13e7566f2c31e563e235dd8e1a7ae8998a6677fae6124806ca842904f6aa0e"
        }
      },

🤖 Generated with Claude Code

…pelines

Computes a SHA-256 hash of the pipeline topology fields (receivers,
processors, exporters, connectors, service.pipelines, service.extensions)
from the OpAMP effective config and stores it as effective_config_hash in
the .fleet-agents index. Non-topology fields (extensions config,
service.telemetry, etc.) are excluded so the hash reflects only what the
pipeline does, not how it is observed. Keys are canonicalized via yaml.v3
Marshal (which sorts alphabetically) before hashing to ensure identical
topologies always produce the same hash regardless of key order.

Closes elastic/ingest-dev#7064

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juliaElastic juliaElastic added enhancement New feature or request backport-skip Skip notification from the automated backport with mergify skip-changelog labels Apr 21, 2026
…h_test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 21, 2026

This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@juliaElastic juliaElastic marked this pull request as ready for review April 22, 2026 07:17
@juliaElastic juliaElastic requested a review from a team as a code owner April 22, 2026 07:17
juliaElastic and others added 2 commits April 22, 2026 10:51
…onfig hash

Adds an adjective-noun label (e.g. "swift-hawk") stored alongside
effective_config_hash in .fleet-agents. The label is derived from the
first two bytes of the SHA-256 hash using two fixed 256-entry wordlists
embedded in source, giving 65,536 possible combinations. Because the
wordlists are frozen in the codebase the mapping is stable across
deployments and dependency updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juliaElastic
Copy link
Copy Markdown
Contributor Author

CI fails due to an environmental issue, not related to changes in this PR:

The job, Package x86_64, has been canceled as it failed to get an agent after 5 tries.

The job, Package x86_64 FIPS, has been canceled as it failed to get an agent after 5 tries.

@ebeahan
Copy link
Copy Markdown
Member

ebeahan commented Apr 23, 2026

CI fails due to an environmental issue, not related to changes in this PR

#6881 should address the CI issues.

Comment thread internal/pkg/api/configHash.go Outdated
if effectiveConfig.ConfigMap == nil || effectiveConfig.ConfigMap.ConfigMap[""] == nil {
return "", nil
}
body := effectiveConfig.ConfigMap.ConfigMap[""].Body
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we should hash everything in the configmap, not just the default config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for multiple config files: 9ac32c8

Though I'm not sure if two separate files should hash differently from the same content in one file.

Comment thread internal/pkg/api/configHash.go Outdated
}

topology := make(map[string]any)
for _, k := range []string{"receivers", "processors", "exporters", "connectors"} {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to only use these keys when computing the sha?
How do we ensure that a change to another key is emitted to a collector?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to only use these keys when computing the sha?

The issue scope is deliberate: the hash is meant to answer "are two collectors running the same pipeline wiring?", not "are all their configs byte-for-byte identical." The topology keys describe what data flows where. Non-topology fields like extensions.* configs or service.telemetry describe how the pipeline operates (scrape intervals, log levels, endpoints) but don't change the pipeline graph.

The use case is: two collectors that differ only in their health_check endpoint or telemetry verbosity should be considered "the same topology" and produce the same hash, so you can cluster/query them together.

How do we ensure that a change to another key is emitted to a collector?

Changes to other keys are not surfaced via the hash, but they are captured in full in effective_config (the redacted JSON blob already stored in .fleet-agents). If a non-topology field changes, effective_config changes, but effective_config_hash stays the same.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the doc string or in schema.json? It's important to note how this hash should differ from the RemoteConfig hash

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be differentiated from AgentRemoteConfig.config_hash in the opamp spec.

Reading https://github.com/elastic/ingest-dev/issues/7064 it's not clear why we need this new concept of a topology hash, vs just the hash of the configuration.

The problem this seems like this topology hash would solve is if you had a lot of collector configurations that had superflous components that did not modify the function of the pipeline. This feels like an optimization that can be done later once we confirm we have this problem.

Is there a reason we can't start with just the config_hash from the spec that includes everything, and only introduce an optimized topology hash as an optimization on that once that's proven to be necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason for the topology hash is to group collectors on the UI that run the same pipelines, @andrewvc might have more context.

Here is Claude's take on it:

Why not just hash the full config?

The OTel collector expands environment variables before it reports EffectiveConfig back to
fleet-server. So a collector config with:

service:
  telemetry:
    resource:
      service.instance.id: "${env:HOSTNAME}"

arrives at fleet-server as:

service:
  telemetry:
    resource:
      service.instance.id: "Julias-MacBook-Pro.local"

Every collector in the same group — running identical pipelines — produces a unique full-config
hash just because of its hostname. Grouping collectors by hash becomes impossible.

That is the concrete reason the topology normalization exists: it strips per-instance runtime
values that vary across collectors in the same group. It is not an optimization for superfluous
components.

When is a full hash sufficient?

If the only requirement is per-collector change detection ("has this collector's config
changed since last check-in?"), a full hash is simpler and still useful. The per-instance
expansion issue does not matter when comparing a single collector against its own history.

If the requirement is topology grouping ("show N collectors running pipeline X"), the
topology hash is necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the topology hashing and using the full effective config to calculate the hash.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz with the new opamp metadata around pipeline groups etc. in #6769 I think the hashes will always be unique in the UI so long as well as the case @juliaElastic mentioned. Correct me if I'm wrong but from what I can tell opamp today just uses the bytes of the file right?

If that's the case the flow around "Tell me if the collectors in this metadata group are all running the same config" no longer can be achieved in the UI or API. So, tracking a config rollout would not be possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still group collectors by the non_identifying_attributes we tell users to populate can't we?

The problem we have with environment variable substitution in those attributes can happen in any part of the collector config, not just service telemetry.

I think this grouping feature is good, I just struggle to see how we can make it work 100% reliably with hashing in Fleet Server.

We are storing these configurations is a search engine that can compute document similarity in queries based on specific fields so I don't know why we need to try to do something based on exact equivalence of fields in a less flexible part of the system design.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point @cmacknz , I agree the hashing mechanism is not perfect. I hadn't thought of using relevance to match similar configs, it's a neat idea! I assume we would implement that a bit later?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Julia did a good analysis in https://github.com/elastic/ingest-dev/issues/7064#issuecomment-4358616898 of why using relevance isn't completely straightforward either.

The conclusion on that thread is asking our PMs about how they view the need for this so we can figure out what the right trade off is. This seems like a completely sensible feature to have, but one where a reliable implementation is much harder than we would like it to be.

Comment thread internal/pkg/api/configLabel_test.go Outdated
Comment thread internal/pkg/api/configLabel_test.go Outdated
Comment thread internal/pkg/api/handleOpAMP.go Outdated
juliaElastic and others added 3 commits April 24, 2026 09:01
Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
…entry

HashEffectiveConfig now iterates all named config files in the OpAMP
ConfigMap in sorted key order, feeding each file's topology fields and
its key name into a single SHA-256. Previously only the default ("") file
was considered. Topology extraction is refactored into extractTopologyFields
for per-file reuse.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 24, 2026
Comment thread internal/pkg/api/configHash.go Outdated
Comment thread internal/pkg/api/configHash.go Outdated
}

topology := make(map[string]any)
for _, k := range []string{"receivers", "processors", "exporters", "connectors"} {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add that to the doc string or in schema.json? It's important to note how this hash should differ from the RemoteConfig hash

Comment thread internal/pkg/api/configHash.go Outdated
Comment thread model/schema.json
},
"effective_config_label": {
"description": "Human-readable adjective-noun label derived from the first two bytes of effective_config_hash",
"type": "string"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a presentation concern. Should it be generated from the effective_config_hash in the UI code?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this feels out of place in fleet-server an definitely feels like a presentation concern. This could be calculated in the UI.

Copy link
Copy Markdown
Contributor Author

@juliaElastic juliaElastic Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move it to the UI, it either has to be calculated dynamically or the .fleet-agents docs has to be updated from the UI to store it.

If we calculate it dynamically, it should be a runtime field to support filtering on it.

Removed the effective_config_label changes from here and added to kibana: elastic/kibana#265005

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a large performance concern with this?

I would like to keep fleet-server in the architectural role an action router, this is getting away from that much more than other things we've done. I can also imagine other requirements that would require the UI to write this name.

I could imagine that users might ask for the ability to customize the names of their configurations in the future, like they do for agent policies. The friendly label is just an initial placeholder before we can do that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using a runtime field, it's a small overhead when querying agents, I would start with that.

If kibana wrote the label to .fleet-agents, there is a chance of write conflicts with fleet-server, and we would need to poll from kibana to notice changes in the effective config hash to update the label.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to poll from kibana to notice changes in the effective config hash to update the label.

Once a label is set, would we want to keep changing it? As I'm understanding it — and let me know if I'm off here — the hash and the label are answering two different questions for a pipeline:

  • hash: has the topology of this pipeline changed?
  • label: is this conceptually the same pipeline as before?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the label has to stay consistent with the effective config hash, and the effective config can change on check-in if the collector config changes.

We are moving away from the topology hash, see #6872 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I view this is that we are trying to build a concept similar to that of an agent policy out of the per agent effective configurations.

That is a configuration shared by multiple agents doing the same job in the same role with a human readable and assignable name.

The problem is the per agent configurations can all vary per agent because parts of the configuration are determined at runtime (just like Elastic Agent) so doing this is not totally trivial.

Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Comment thread internal/pkg/api/configHash.go Outdated
juliaElastic and others added 3 commits April 28, 2026 11:40
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread internal/pkg/api/configHash.go Outdated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juliaElastic juliaElastic requested a review from cmacknz April 29, 2026 07:36
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread internal/pkg/api/configHash.go
Comment thread internal/pkg/api/handleOpAMP.go Outdated
initialOpts = append(initialOpts, checkin.WithEffectiveConfigHash(configHash))
}

if defaultParsed, ok := parsedFiles[""]; ok {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only include the effective configuration if it uses a single configuration file? If the collector defines multiple configurations this does nothing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to handle multiple files

Comment thread internal/pkg/api/handleOpAMP.go Outdated

if defaultParsed, ok := parsedFiles[""]; ok {
redactSensitive(defaultParsed)
effectiveConfigBytes, err := json.Marshal(defaultParsed)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashEffectiveConfig marshals each individual configuration JSON already and we throw those bytes away, seems like we should reuse them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


if aToS.EffectiveConfig != nil {
effectiveConfigBytes, err := ParseEffectiveConfig(aToS.EffectiveConfig)
parsedFiles, err := parseConfigFiles(aToS.EffectiveConfig)
Copy link
Copy Markdown
Member

@cmacknz cmacknz Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help if you had unit tests showing that this handles all the AgentConfigMap arrangements we can have no files, one file, multiple files, non-yaml files, etc

https://github.com/open-telemetry/opamp-spec/blob/2da595f59a0016abe67b4d44aa52afa3549f8742/proto/opamp.proto#L1032-L1039

It's not completely trivial to do this hash so if we don't have a concrete use for it in the UI we could always defer it. I think we have already concluded it's not a great way to group collector configurations together.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added unit tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the PR to draft

- Skip non-text/yaml config files per OpAMP spec (content_type check)
- Store all config files, not just the single unnamed one; multi-file
  and named-file configs are stored as {"name": content} keyed maps
- Return canonical JSON bytes from HashEffectiveConfig so the storage
  path reuses them without a second marshal pass; redaction now happens
  before hashing so the returned bytes are already redacted
- Add TestUpdateAgentEffectiveConfigMap covering nil config, empty map,
  single unnamed file, single named file, multiple files, non-YAML
  files, mixed YAML/non-YAML, empty body, and sensitive field redaction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juliaElastic juliaElastic marked this pull request as draft April 30, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants