feat(opamp): compute and store effective_config_hash for collector pipelines#6872
feat(opamp): compute and store effective_config_hash for collector pipelines#6872juliaElastic wants to merge 17 commits intoelastic:mainfrom
Conversation
…pelines Computes a SHA-256 hash of the pipeline topology fields (receivers, processors, exporters, connectors, service.pipelines, service.extensions) from the OpAMP effective config and stores it as effective_config_hash in the .fleet-agents index. Non-topology fields (extensions config, service.telemetry, etc.) are excluded so the hash reflects only what the pipeline does, not how it is observed. Keys are canonicalized via yaml.v3 Marshal (which sorts alphabetically) before hashing to ensure identical topologies always produce the same hash regardless of key order. Closes elastic/ingest-dev#7064 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…h_test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
|
…onfig hash Adds an adjective-noun label (e.g. "swift-hawk") stored alongside effective_config_hash in .fleet-agents. The label is derived from the first two bytes of the SHA-256 hash using two fixed 256-entry wordlists embedded in source, giving 65,536 possible combinations. Because the wordlists are frozen in the codebase the mapping is stable across deployments and dependency updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
CI fails due to an environmental issue, not related to changes in this PR: The job, Package x86_64, has been canceled as it failed to get an agent after 5 tries. The job, Package x86_64 FIPS, has been canceled as it failed to get an agent after 5 tries. |
#6881 should address the CI issues. |
| if effectiveConfig.ConfigMap == nil || effectiveConfig.ConfigMap.ConfigMap[""] == nil { | ||
| return "", nil | ||
| } | ||
| body := effectiveConfig.ConfigMap.ConfigMap[""].Body |
There was a problem hiding this comment.
In theory we should hash everything in the configmap, not just the default config
There was a problem hiding this comment.
Added support for multiple config files: 9ac32c8
Though I'm not sure if two separate files should hash differently from the same content in one file.
| } | ||
|
|
||
| topology := make(map[string]any) | ||
| for _, k := range []string{"receivers", "processors", "exporters", "connectors"} { |
There was a problem hiding this comment.
Why do we need to only use these keys when computing the sha?
How do we ensure that a change to another key is emitted to a collector?
There was a problem hiding this comment.
Why do we need to only use these keys when computing the sha?
The issue scope is deliberate: the hash is meant to answer "are two collectors running the same pipeline wiring?", not "are all their configs byte-for-byte identical." The topology keys describe what data flows where. Non-topology fields like extensions.* configs or service.telemetry describe how the pipeline operates (scrape intervals, log levels, endpoints) but don't change the pipeline graph.
The use case is: two collectors that differ only in their health_check endpoint or telemetry verbosity should be considered "the same topology" and produce the same hash, so you can cluster/query them together.
How do we ensure that a change to another key is emitted to a collector?
Changes to other keys are not surfaced via the hash, but they are captured in full in effective_config (the redacted JSON blob already stored in .fleet-agents). If a non-topology field changes, effective_config changes, but effective_config_hash stays the same.
There was a problem hiding this comment.
Can you add that to the doc string or in schema.json? It's important to note how this hash should differ from the RemoteConfig hash
There was a problem hiding this comment.
Yes, this needs to be differentiated from AgentRemoteConfig.config_hash in the opamp spec.
Reading https://github.com/elastic/ingest-dev/issues/7064 it's not clear why we need this new concept of a topology hash, vs just the hash of the configuration.
The problem this seems like this topology hash would solve is if you had a lot of collector configurations that had superflous components that did not modify the function of the pipeline. This feels like an optimization that can be done later once we confirm we have this problem.
Is there a reason we can't start with just the config_hash from the spec that includes everything, and only introduce an optimized topology hash as an optimization on that once that's proven to be necessary?
There was a problem hiding this comment.
I think the reason for the topology hash is to group collectors on the UI that run the same pipelines, @andrewvc might have more context.
Here is Claude's take on it:
Why not just hash the full config?
The OTel collector expands environment variables before it reports EffectiveConfig back to
fleet-server. So a collector config with:
service:
telemetry:
resource:
service.instance.id: "${env:HOSTNAME}"arrives at fleet-server as:
service:
telemetry:
resource:
service.instance.id: "Julias-MacBook-Pro.local"Every collector in the same group — running identical pipelines — produces a unique full-config
hash just because of its hostname. Grouping collectors by hash becomes impossible.
That is the concrete reason the topology normalization exists: it strips per-instance runtime
values that vary across collectors in the same group. It is not an optimization for superfluous
components.
When is a full hash sufficient?
If the only requirement is per-collector change detection ("has this collector's config
changed since last check-in?"), a full hash is simpler and still useful. The per-instance
expansion issue does not matter when comparing a single collector against its own history.
If the requirement is topology grouping ("show N collectors running pipeline X"), the
topology hash is necessary.
There was a problem hiding this comment.
Removed the topology hashing and using the full effective config to calculate the hash.
There was a problem hiding this comment.
@cmacknz with the new opamp metadata around pipeline groups etc. in #6769 I think the hashes will always be unique in the UI so long as well as the case @juliaElastic mentioned. Correct me if I'm wrong but from what I can tell opamp today just uses the bytes of the file right?
If that's the case the flow around "Tell me if the collectors in this metadata group are all running the same config" no longer can be achieved in the UI or API. So, tracking a config rollout would not be possible.
There was a problem hiding this comment.
We can still group collectors by the non_identifying_attributes we tell users to populate can't we?
The problem we have with environment variable substitution in those attributes can happen in any part of the collector config, not just service telemetry.
I think this grouping feature is good, I just struggle to see how we can make it work 100% reliably with hashing in Fleet Server.
We are storing these configurations is a search engine that can compute document similarity in queries based on specific fields so I don't know why we need to try to do something based on exact equivalence of fields in a less flexible part of the system design.
There was a problem hiding this comment.
It's a good point @cmacknz , I agree the hashing mechanism is not perfect. I hadn't thought of using relevance to match similar configs, it's a neat idea! I assume we would implement that a bit later?
There was a problem hiding this comment.
Julia did a good analysis in https://github.com/elastic/ingest-dev/issues/7064#issuecomment-4358616898 of why using relevance isn't completely straightforward either.
The conclusion on that thread is asking our PMs about how they view the need for this so we can figure out what the right trade off is. This seems like a completely sensible feature to have, but one where a reliable implementation is much harder than we would like it to be.
Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
…entry
HashEffectiveConfig now iterates all named config files in the OpAMP
ConfigMap in sorted key order, feeding each file's topology fields and
its key name into a single SHA-256. Previously only the default ("") file
was considered. Topology extraction is refactored into extractTopologyFields
for per-file reuse.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| } | ||
|
|
||
| topology := make(map[string]any) | ||
| for _, k := range []string{"receivers", "processors", "exporters", "connectors"} { |
There was a problem hiding this comment.
Can you add that to the doc string or in schema.json? It's important to note how this hash should differ from the RemoteConfig hash
| }, | ||
| "effective_config_label": { | ||
| "description": "Human-readable adjective-noun label derived from the first two bytes of effective_config_hash", | ||
| "type": "string" |
There was a problem hiding this comment.
This feels like a presentation concern. Should it be generated from the effective_config_hash in the UI code?
There was a problem hiding this comment.
Agree this feels out of place in fleet-server an definitely feels like a presentation concern. This could be calculated in the UI.
There was a problem hiding this comment.
If we move it to the UI, it either has to be calculated dynamically or the .fleet-agents docs has to be updated from the UI to store it.
If we calculate it dynamically, it should be a runtime field to support filtering on it.
Removed the effective_config_label changes from here and added to kibana: elastic/kibana#265005
There was a problem hiding this comment.
Is there a large performance concern with this?
I would like to keep fleet-server in the architectural role an action router, this is getting away from that much more than other things we've done. I can also imagine other requirements that would require the UI to write this name.
I could imagine that users might ask for the ability to customize the names of their configurations in the future, like they do for agent policies. The friendly label is just an initial placeholder before we can do that.
There was a problem hiding this comment.
If using a runtime field, it's a small overhead when querying agents, I would start with that.
If kibana wrote the label to .fleet-agents, there is a chance of write conflicts with fleet-server, and we would need to poll from kibana to notice changes in the effective config hash to update the label.
There was a problem hiding this comment.
we would need to poll from kibana to notice changes in the effective config hash to update the label.
Once a label is set, would we want to keep changing it? As I'm understanding it — and let me know if I'm off here — the hash and the label are answering two different questions for a pipeline:
- hash: has the topology of this pipeline changed?
- label: is this conceptually the same pipeline as before?
There was a problem hiding this comment.
I mean the label has to stay consistent with the effective config hash, and the effective config can change on check-in if the collector config changes.
We are moving away from the topology hash, see #6872 (comment)
There was a problem hiding this comment.
The way I view this is that we are trying to build a concept similar to that of an agent policy out of the per agent effective configurations.
That is a configuration shared by multiple agents doing the same job in the same role with a human readable and assignable name.
The problem is the per agent configurations can all vary per agent because parts of the configuration are determined at runtime (just like Elastic Agent) so doing this is not totally trivial.
Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| initialOpts = append(initialOpts, checkin.WithEffectiveConfigHash(configHash)) | ||
| } | ||
|
|
||
| if defaultParsed, ok := parsedFiles[""]; ok { |
There was a problem hiding this comment.
We only include the effective configuration if it uses a single configuration file? If the collector defines multiple configurations this does nothing?
There was a problem hiding this comment.
changed to handle multiple files
|
|
||
| if defaultParsed, ok := parsedFiles[""]; ok { | ||
| redactSensitive(defaultParsed) | ||
| effectiveConfigBytes, err := json.Marshal(defaultParsed) |
There was a problem hiding this comment.
HashEffectiveConfig marshals each individual configuration JSON already and we throw those bytes away, seems like we should reuse them?
|
|
||
| if aToS.EffectiveConfig != nil { | ||
| effectiveConfigBytes, err := ParseEffectiveConfig(aToS.EffectiveConfig) | ||
| parsedFiles, err := parseConfigFiles(aToS.EffectiveConfig) |
There was a problem hiding this comment.
It would help if you had unit tests showing that this handles all the AgentConfigMap arrangements we can have no files, one file, multiple files, non-yaml files, etc
It's not completely trivial to do this hash so if we don't have a concrete use for it in the UI we could always defer it. I think we have already concluded it's not a great way to group collector configurations together.
There was a problem hiding this comment.
added unit tests
There was a problem hiding this comment.
I moved the PR to draft
- Skip non-text/yaml config files per OpAMP spec (content_type check)
- Store all config files, not just the single unnamed one; multi-file
and named-file configs are stored as {"name": content} keyed maps
- Return canonical JSON bytes from HashEffectiveConfig so the storage
path reuses them without a second marshal pass; redaction now happens
before hashing so the returned bytes are already redacted
- Add TestUpdateAgentEffectiveConfigMap covering nil config, empty map,
single unnamed file, single named file, multiple files, non-YAML
files, mixed YAML/non-YAML, empty body, and sensitive field redaction
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
effective_config_hashin the.fleet-agentsindex.yaml.v3Marshal (sorts alphabetically) before hashing, so key order never affects the output.effective_config).Closes: https://github.com/elastic/ingest-dev/issues/7064
To verify:
effective_config_hashis stored in.fleet-agents.🤖 Generated with Claude Code