Fix for extension dropped config issue & revert canary release PR#1718
Merged
Conversation
This reverts commit 7ea95a2. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ksum restart When the extension manager stops re-delivering protectedParameters (e.g. during an extension auto-update), the rendered .Values workspace key becomes empty and the chart would overwrite the live ama-logs-secret KEY with an empty value, breaking the agent ~10-14 days later when cached credentials expire. Changes: - ama-logs-secret.yaml: when the incoming workspace key is empty/placeholder, fall back to the extension manager's on-cluster protected-parameters secret 'protected-ext-parameters-<release>' (data key 'OmsAgent.workspaceKey' for AKS or 'amalogs.secret.key' for Arc). This is the secret the config agent persists from protectedParameters; the agent does not clear it when it drops the CR reference, so it remains the source of truth. Only the KEY is recovered - WSID is non-protected and still delivered. lookup is a no-op on first install (incoming key populated). - ama-logs-daemonset.yaml / ama-logs-daemonset-windows.yaml / ama-logs-deployment.yaml: add checksum/secret annotation to the AKS (non-Arc) Linux and Windows pod templates so pods roll when the effective secret changes. The previous WSID-only annotation did not detect workspace KEY changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The protected-parameters key fallback only applies to workspace-key (non-AAD) clusters. On AAD/managed-identity clusters the workspace key is empty by design (auth uses a token, not a key), so recovering a key there is meaningless. Evaluate isUsingAADAuth (AKS) / useAADAuth (Arc) the same way the daemonset/deployment templates do, and skip the lookup + fallback entirely when AAD auth is in use. Validated on a live AKS cluster: with a key injected into protected-ext-parameters-*, isUsingAADAuth=true does NOT recover it (gated), isUsingAADAuth=false does. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gate the protected-parameters lookup on BOTH non-AAD auth AND the incoming workspace key being empty/placeholder, so the recovery runs only when the mandatory key is missing (the cluster is already broken / about to be). Healthy clusters (key supplied) and AAD clusters skip the live secret read entirely — no unnecessary get-secret API call on the common path. Uses an explicit if/else (not ternary) to pick the active path's incoming key: OmsAgent.workspaceKey for AKS, amalogs.secret.key for Arc. Validated on a live AKS cluster: non-AAD+empty recovers; non-AAD+real key uses the supplied key (lookup skipped); AAD+empty stays empty (gated). helm lint clean. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Contributor
|
Contributor
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
zanejohnson-azure
approved these changes
Jun 19, 2026
rashmichandrashekar
approved these changes
Jun 19, 2026
zanejohnson-azure
added a commit
that referenced
this pull request
Jun 23, 2026
Restores the canary stages reverted in #1718 (Stage_Canary_MCR, Stage_Canary_Regions, Wait_After_Canary) so the Arc prod pipeline runs canary -> 25h bake -> manual-gated prod tiers in one pipeline. Includes the fix that the original #1715 was missing: Stage_1 is trigger: manual WITHOUT an explicit dependsOn (ADO rejects dependsOn on manually-triggered stages). Stage_1 still orders after Wait_After_Canary via default sequential dependency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to the Azure Monitor Container Insights Helm chart, focusing on enhancing secret management for workspace keys and adding a configuration fix flag. The main changes ensure that missing workspace keys are recovered from an existing protected secret, preventing agent outages, and add a checksum annotation to trigger pod restarts when secrets change.
Secret management and recovery improvements:
ama-logs-secret.yamlto recover the workspace key from theprotected-ext-parameters-<release>secret if the key is missing and the cluster is not using AAD authentication, preventing agent downtime during extension manager updates. The effective workspace key ($arcKey/$aksKey) is now resolved using this recovery logic.ama-logs-secret.yamlto use the resolved workspace key variables ($arcKeyand$aksKey) instead of the direct values from.Values, ensuring the recovered key is used when necessary. [1] [2]Deployment and pod restart improvements:
checksum/secretannotation to the pod templates inama-logs-daemonset.yaml,ama-logs-daemonset-windows.yaml, andama-logs-deployment.yamlto ensure pods are restarted automatically when the secret changes. [1] [2] [3]Configuration fixes:
EXTENSION_DROPPED_CONFIG_FIX_ENABLEDenvironment variable (set to"true") to all relevant deployments and daemonsets, enabling a fix for extension-dropped configuration issues. [1] [2] [3]