Skip to content

Comments

Azure VM deployment with Terraform + Ansible#143

Open
gaganso wants to merge 34 commits intomainfrom
terraform
Open

Azure VM deployment with Terraform + Ansible#143
gaganso wants to merge 34 commits intomainfrom
terraform

Conversation

@gaganso
Copy link
Collaborator

@gaganso gaganso commented Feb 18, 2026

Summary

Adds fully automated deployment of Kubernetes clusters on Azure VMs using Terraform and Ansible, with a single-command deploy.py script.

Deployment automation (Terraform + Ansible)

  • deploy.py: Single-command orchestrator that provisions Azure VMs, runs Ansible, installs local tools (kubectl, helm, poetry), and configures AIOpsLab -- supports --plan, --apply, and --destroy
  • Mode A (AIOpsLab on controller VM) and Mode B (AIOpsLab on laptop with remote kubectl)
  • Terraform: Azure VMs (controller + N workers), VNet, NSG with configurable source via --allowed-ips / nsg_allowed_source, public IPs
  • Ansible: Docker CE + cri-dockerd, Kubernetes v1.31 (kubeadm), Flannel CNI, kubeconfig fetch with public IP SAN
  • generate_inventory.py: Auto-generates Ansible inventory from Terraform outputs
  • Security: Kubeconfig permissions set to 0600, admin.conf restored to 0600 after fetch, join token/cert hash debug tasks use no_log: true, kubectl binary verified with SHA256 checksum
  • terraform.tfvars.example: Documented config template with sizing examples

Documentation changes

  • scripts/terraform/README.md: Full deployment guide with Mode A/B, VM sizing, troubleshooting, cost management
  • CLAUDE.md: Added Azure deployment section with key files, config reference, common issues, architecture diagram
  • README.md and TutorialSetup.md: Added Terraform/Ansible as cluster setup option

Other changes (not deployment-specific)

  • aiopslab/service/kubectl.py: Fix kubeconfig context for non-kind clusters -- read k8s_host from config.yml, only use kind-* context for kind/localhost
  • aiopslab/service/helm.py: Guard -f values_file in upgrade with if values_file: check; add FileNotFoundError with submodule hint for missing charts
  • aiopslab/service/telemetry/prometheus.py: Downgrade "release not found" from ERROR+traceback to WARNING on first run
  • poetry.lock: Refreshed lock file
  • .gitignore: Add terraform state files, ansible retry files

Test plan

  • deploy.py --plan dry-run succeeds with correct Terraform plan
  • deploy.py --apply --mode B provisions VMs, runs Ansible, configures AIOpsLab
  • kubectl get nodes shows all nodes Ready from laptop
  • python3 cli.py + start misconfig_app_hotel_res-detection-1 runs end-to-end (Prometheus, OpenEBS, app deploy, fault injection, workload)
  • deploy.py --destroy cleans up all Azure resources

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

gaganso and others added 30 commits August 24, 2025 00:14
poetry shell was removed in Poetry 2.0. Update README.md and
TutorialSetup.md to use the new activation command and drop the
poetry-plugin-shell dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add nsg_allowed_source variable (default "*") to control SSH and K8s
API NSG rules declaratively. Replaces hardcoded CorpNetPublic so
deploys work without corporate VPN. Users can pass a CIDR or Azure
service tag to restrict access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove add_nsg_corpnet_rule() in favor of Terraform nsg_allowed_source
- Add --allowed-ips flag to pass NSG source through to Terraform
- Add auto-install for kubectl, helm, and poetry (Linux/amd64)
- Add setup_aiopslab_mode_b(): verifies kubeconfig, generates
  aiopslab/config.yml, runs poetry install, prints summary table
- Add setup_aiopslab_mode_a() placeholder
- Remove emoji from log messages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document --mode B automatic setup, add tested-on platform note, add
git worktree WSL submodule caveat, update Mode A instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove duplicate Terraform quick start, merge into single Azure
Deployment section with deploy.py single-command workflow. Fix
poetry shell references and add tested-on platform note.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The submodule path check incorrectly treated remote helm repo
references (e.g. chaos-mesh/chaos-mesh) as local paths. Add
remote_chart flag to Chaos Mesh config and fix upgrade method
to respect the same flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	CLAUDE.md
#	poetry.lock
- Replace cluster_size with worker_vm_count in manual tf commands
- Replace poetry shell with eval $(poetry env activate)
- Add poetry env use python3.11 before poetry install
- Remove Documentation section (links to nonexistent files)
- Remove SECURITY.md references
- Replace --restrict-ssh-corpnet with --allowed-ips CorpNetPublic
- Remove deploy_old.py and DEPLOYMENT_GUIDE.md from file tree
- Remove "Migration from v1.0" section

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NSG rules default to open (*), not restricted to CorpNetPublic.
Document --allowed-ips flag and nsg_allowed_source variable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change kubeconfig copy mode from 0644 to 0600 on control plane
- Restore /etc/kubernetes/admin.conf to 0600 after fetch
- Set ~/.kube/config to 0600 on localhost after fetch
- Add no_log: true to kube_token and cert_hash debug tasks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard -f values_file with conditional check in helm upgrade
- Remove unused azurerm_public_ip_prefix resource from main.tf
- Remove unused location variable from variables.tf and tfvars example
- Add gen1/gen2 note to os_sku variable description
- Add trailing newlines to main.tf and variables.tf

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Download and verify kubectl binary checksum before installing.
Add note about helm pipe-to-bash being the official install method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge from main introduced AIOPSLAB_CLUSTER which hardcodes a
kind- prefix on the context name, breaking remote/Azure clusters.
Now reads k8s_host from config.yml: uses kind-kind for kind/localhost,
default kubeconfig context for remote hosts, and AIOPSLAB_CLUSTER
env var still works for parallel kind clusters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When Prometheus isn't installed yet, Helm.status() raises RuntimeError.
This is expected on first run -- log a warning instead of a full traceback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gaganso gaganso requested a review from Copilot February 18, 2026 21:48
@gaganso gaganso linked an issue Feb 18, 2026 that may be closed by this pull request
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces comprehensive automated deployment capabilities for AIOpsLab on Azure using Terraform and Ansible. It transforms the deployment process from manual VM setup to a single-command operation that provisions infrastructure, configures Kubernetes clusters, and sets up the AIOpsLab environment.

Changes:

  • Single-command deployment orchestration via deploy.py with support for plan, apply, and destroy operations
  • Dynamic Kubernetes cluster provisioning (1 controller + N workers) on Azure VMs with configurable sizing
  • Automated Ansible playbook execution for Docker, Kubernetes, and CNI installation
  • Enhanced application code to support remote clusters with proper kubeconfig context handling
  • Comprehensive documentation covering deployment modes, troubleshooting, and security best practices

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
scripts/terraform/deploy.py Main orchestrator script (new) - handles Terraform, Ansible, tool installation, and AIOpsLab configuration
scripts/terraform/generate_inventory.py Generates Ansible inventory from Terraform outputs (new)
scripts/terraform/main.tf Refactored infrastructure definition with dynamic worker nodes and proper NSG rules
scripts/terraform/variables.tf Complete rewrite with new variables for VM size, count, SSH keys, and NSG configuration
scripts/terraform/outputs.tf Restructured outputs with controller/worker details for automation
scripts/terraform/providers.tf Updated provider versions and added skip_provider_registration guidance
scripts/terraform/terraform.tfvars.example New configuration template with examples (new)
scripts/terraform/README.md Comprehensive deployment guide with Mode A/B, troubleshooting, and cost management
scripts/terraform/ssh.tf Removed (replaced with user-provided SSH keys)
scripts/terraform/data.tf Moved to main.tf
scripts/ansible/setup_common.yml Added conntrack package for kubeadm
scripts/ansible/remote_setup_controller_worker.yml Major enhancements: public IP SAN support, proper permissions, idempotency, verification steps
scripts/ansible/inventory.yml.example Updated with private_ip fields and improved documentation
aiopslab/service/kubectl.py Fixed context selection to support non-kind clusters
aiopslab/service/helm.py Added chart existence checks and values_file guard
aiopslab/service/telemetry/prometheus.py Downgraded "release not found" from ERROR to WARNING
aiopslab/generators/fault/inject_symp.py Added remote_chart flag for Chaos Mesh
README.md Added Azure deployment option with Poetry installation guidance
TutorialSetup.md Updated to use eval $(poetry env activate)
CLAUDE.md Documented Azure deployment workflow and troubleshooting
.gitignore Added terraform state files and sensitive configuration files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +97 to +98
regexp: 'https://{{ control_plane_ip }}:6443'
replace: 'https://{{ control_plane_public_ip }}:6443'
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regexp replacement uses the private IP but should match any IP pattern. If the control_plane_ip format doesn't match exactly (e.g., has different formatting), the replacement will silently fail. Consider using a more robust pattern like 'regexp: "server: https://[^:]+:6443"' to match any IP or hostname.

Suggested change
regexp: 'https://{{ control_plane_ip }}:6443'
replace: 'https://{{ control_plane_public_ip }}:6443'
regexp: 'server: https://[^:]+:6443'
replace: 'server: https://{{ control_plane_public_ip }}:6443'

Copilot uses AI. Check for mistakes.
# Confirm destruction
confirm = input("This will destroy all resources. Type 'yes' to confirm: ")
if confirm.lower() != 'yes':
logger.info("Destruction cancelled")
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The destroy operation requires manual confirmation (line 675), which is good for safety. However, if the confirmation is not 'yes', the function returns False but logs 'Destruction cancelled' as INFO level. Consider using logger.warning() instead to make it more visible that the operation was intentionally cancelled, distinguishing it from an actual failure.

Suggested change
logger.info("Destruction cancelled")
logger.warning("Destruction cancelled")

Copilot uses AI. Check for mistakes.
value = {
name = azurerm_linux_virtual_machine.controller.name
public_ip = azurerm_public_ip.controller.ip_address
private_ip = azurerm_network_interface.controller.ip_configuration[0].private_ip_address
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private_ip field is referenced in generate_inventory.py (line 88) and remote_setup_controller_worker.yml (line 14), but it's accessed as an output of ip_configuration. This access pattern is correct for Azure, but consider adding a validation check in generate_inventory.py to ensure the private_ip is not empty before writing to the inventory.

Copilot uses AI. Check for mistakes.
Comment on lines +310 to +315
ver = subprocess.run(
[path, "--version"], capture_output=True, text=True, check=True
).stdout.strip()
minor = int(ver.split(".")[1])
if minor >= 11:
return candidate, ver
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python version detection logic parses the version string by splitting on '.' and accessing index [1] for the minor version. This will fail if Python returns a version string in an unexpected format or if the split doesn't produce enough elements. Add error handling or validation to prevent IndexError.

Suggested change
ver = subprocess.run(
[path, "--version"], capture_output=True, text=True, check=True
).stdout.strip()
minor = int(ver.split(".")[1])
if minor >= 11:
return candidate, ver
proc = subprocess.run(
[path, "--version"], capture_output=True, text=True, check=True
)
# Some Python versions print to stderr instead of stdout
ver_output = (proc.stdout or proc.stderr or "").strip()
match = re.search(r"Python\s+(\d+)\.(\d+)\.(\d+)", ver_output)
if not match:
continue
major = int(match.group(1))
minor = int(match.group(2))
# Accept Python 3.11+ and any higher major versions
if major > 3 or (major == 3 and minor >= 11):
return candidate, ver_output

Copilot uses AI. Check for mistakes.
Comment on lines 28 to 31
if cluster_env:
context = f"kind-{cluster_env}"
elif k8s_host == "kind" or k8s_host == "localhost":
context = "kind-kind"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for context selection has a potential issue: if k8s_host is set to "localhost", it will use "kind-kind" context (line 30-31). However, for remote clusters accessed via localhost (like with port forwarding), this would incorrectly use the kind context. Consider checking if the context actually exists before using it, or refining the logic to distinguish between kind clusters and remote clusters accessed via localhost.

Suggested change
if cluster_env:
context = f"kind-{cluster_env}"
elif k8s_host == "kind" or k8s_host == "localhost":
context = "kind-kind"
# Detect whether the default kind context ("kind-kind") exists in the current kubeconfig.
kind_context_exists = False
try:
contexts, _ = config.list_kube_config_contexts()
if contexts:
kind_context_exists = any(
ctx.get("name") == "kind-kind" for ctx in contexts
)
except Exception:
# If kubeconfig can't be listed, fall back to using the default context.
kind_context_exists = False
if cluster_env:
context = f"kind-{cluster_env}"
elif k8s_host == "kind" or k8s_host == "localhost":
# Only use the kind context if it actually exists; otherwise, use the default context.
context = "kind-kind" if kind_context_exists else None

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +183
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.settimeout(5)
result = sock.connect_ex((host, port))
sock.close()

if result == 0:
logger.info(f"SSH available on {host}")
return True

except socket.error:
pass

time.sleep(interval)

Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The socket connection check (line 172) uses connect_ex which returns 0 on success. However, the socket is not properly closed in all cases - if an exception occurs in the except block (line 179), the socket remains open. Consider using a 'with' context manager or ensure the socket.close() is in a finally block.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to 47
# SSH access - restrict via var.nsg_allowed_source or --allowed-ips in deploy.py
security_rule {
name = "SSH"
priority = 1001
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "22"
source_address_prefix = "*"
source_address_prefix = var.nsg_allowed_source
destination_address_prefix = "*"
}
}

resource "azurerm_network_security_group" "aiopslab_nsg_2" {
name = "${var.resource_name_prefix}_aiopslabNSG_2"
location = var.resource_location
resource_group_name = var.resource_group_name

# Kubernetes API server - for remote kubectl access (Mode B)
security_rule {
name = "SSH"
priority = 1001
name = "KubernetesAPI"
priority = 110
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "22"
source_address_prefix = "*"
destination_port_range = "6443"
source_address_prefix = var.nsg_allowed_source
destination_address_prefix = "*"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The azurerm_network_security_group rules for SSH and Kubernetes API use var.nsg_allowed_source, which currently defaults to *, leaving ports 22 and 6443 exposed to the entire internet. An external attacker can scan these endpoints and attempt to exploit SSH or kube-apiserver vulnerabilities or stolen credentials to gain control of the VMs and cluster. Restrict source_address_prefix to specific CIDRs or Azure service tags by default (and require explicit opt-in for *) so that management and API ports are not globally accessible.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +64
description = "Source address prefix for NSG rules (SSH + K8s API). Use '*' for open access, a CIDR like '203.0.113.0/24', or an Azure service tag like 'CorpNetPublic'."
default = "*"
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nsg_allowed_source variable defaults to *, which means that unless explicitly overridden, both SSH (22) and Kubernetes API (6443) NSG rules will allow inbound traffic from any IP address. This broad default significantly increases the attack surface by exposing management and control-plane services to internet-wide scanning and exploitation attempts. Use a safer default (such as a corporate CIDR or no default that forces the caller to supply a restricted prefix) and treat * only as an explicit, documented testing override.

Suggested change
description = "Source address prefix for NSG rules (SSH + K8s API). Use '*' for open access, a CIDR like '203.0.113.0/24', or an Azure service tag like 'CorpNetPublic'."
default = "*"
description = "Required: source address prefix for NSG rules (SSH + K8s API). Use a restricted CIDR like '203.0.113.0/24' or an Azure service tag like 'CorpNetPublic'. Use '*' only as an explicit, temporary testing override."

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +164
# Install Docker, Kubernetes packages on all nodes
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml

# Initialize K8s cluster and join workers
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation recommends running Ansible with ANSIBLE_HOST_KEY_CHECKING=False, which disables SSH host key verification for cluster provisioning. This allows a network attacker in the path between your machine and Azure to perform a man-in-the-middle attack on the SSH connections and inject or observe all provisioning commands and secrets. Remove ANSIBLE_HOST_KEY_CHECKING=False from the recommended commands and instead manage host keys via known_hosts or Ansible's ssh_known_hosts facilities so host authenticity is still verified.

Suggested change
# Install Docker, Kubernetes packages on all nodes
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml
# Initialize K8s cluster and join workers
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml
# Ensure target host SSH keys are present in ~/.ssh/known_hosts (e.g., via ssh-keyscan or Ansible's ssh_known_hosts module).
# Install Docker, Kubernetes packages on all nodes
ansible-playbook -i inventory.yml setup_common.yml
# Initialize K8s cluster and join workers
ansible-playbook -i inventory.yml remote_setup_controller_worker.yml

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +119
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure deployment instructions suggest running Ansible with ANSIBLE_HOST_KEY_CHECKING=False, which turns off SSH host key verification for provisioning the Kubernetes cluster. With host key checks disabled, a network-positioned attacker can impersonate the VMs during Ansible runs and gain control over the cluster or capture credentials without being detected. Drop ANSIBLE_HOST_KEY_CHECKING=False from the example commands and rely on proper SSH host key management so users maintain protection against man-in-the-middle attacks.

Suggested change
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml
ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml
ansible-playbook -i inventory.yml setup_common.yml
ansible-playbook -i inventory.yml remote_setup_controller_worker.yml

Copilot uses AI. Check for mistakes.
Gagan and others added 3 commits February 23, 2026 16:25
Mode A runs AIOpsLab on a kubeadm controller VM where k8s_host is
"localhost" but the kubeconfig context is kubernetes-admin@kubernetes,
not kind-kind. Only treat k8s_host="kind" as a kind cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup_aiopslab.yml installs Python 3.11 (deadsnakes PPA), Poetry, Helm,
delivers code via git clone or rsync (dev mode), adds user to docker
group for VirtualizationFaultInjector, generates config.yml from Jinja2
template, and runs poetry install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements setup_aiopslab_mode_a() which runs the Ansible playbook with
extra-vars for clone vs rsync mode. Adds setup_only() for re-running
setup without reprovisioning VMs. New CLI flags: --setup-only (mutually
exclusive with --plan/--apply/--destroy), --dev (rsync local repo
instead of git clone, Mode A only).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automate provisioning and deployment

1 participant