Conversation
poetry shell was removed in Poetry 2.0. Update README.md and TutorialSetup.md to use the new activation command and drop the poetry-plugin-shell dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add nsg_allowed_source variable (default "*") to control SSH and K8s API NSG rules declaratively. Replaces hardcoded CorpNetPublic so deploys work without corporate VPN. Users can pass a CIDR or Azure service tag to restrict access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove add_nsg_corpnet_rule() in favor of Terraform nsg_allowed_source - Add --allowed-ips flag to pass NSG source through to Terraform - Add auto-install for kubectl, helm, and poetry (Linux/amd64) - Add setup_aiopslab_mode_b(): verifies kubeconfig, generates aiopslab/config.yml, runs poetry install, prints summary table - Add setup_aiopslab_mode_a() placeholder - Remove emoji from log messages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document --mode B automatic setup, add tested-on platform note, add git worktree WSL submodule caveat, update Mode A instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove duplicate Terraform quick start, merge into single Azure Deployment section with deploy.py single-command workflow. Fix poetry shell references and add tested-on platform note. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The submodule path check incorrectly treated remote helm repo references (e.g. chaos-mesh/chaos-mesh) as local paths. Add remote_chart flag to Chaos Mesh config and fix upgrade method to respect the same flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # CLAUDE.md # poetry.lock
- Replace cluster_size with worker_vm_count in manual tf commands - Replace poetry shell with eval $(poetry env activate) - Add poetry env use python3.11 before poetry install - Remove Documentation section (links to nonexistent files) - Remove SECURITY.md references - Replace --restrict-ssh-corpnet with --allowed-ips CorpNetPublic - Remove deploy_old.py and DEPLOYMENT_GUIDE.md from file tree - Remove "Migration from v1.0" section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NSG rules default to open (*), not restricted to CorpNetPublic. Document --allowed-ips flag and nsg_allowed_source variable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Change kubeconfig copy mode from 0644 to 0600 on control plane - Restore /etc/kubernetes/admin.conf to 0600 after fetch - Set ~/.kube/config to 0600 on localhost after fetch - Add no_log: true to kube_token and cert_hash debug tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard -f values_file with conditional check in helm upgrade - Remove unused azurerm_public_ip_prefix resource from main.tf - Remove unused location variable from variables.tf and tfvars example - Add gen1/gen2 note to os_sku variable description - Add trailing newlines to main.tf and variables.tf Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Download and verify kubectl binary checksum before installing. Add note about helm pipe-to-bash being the official install method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The merge from main introduced AIOPSLAB_CLUSTER which hardcodes a kind- prefix on the context name, breaking remote/Azure clusters. Now reads k8s_host from config.yml: uses kind-kind for kind/localhost, default kubeconfig context for remote hosts, and AIOPSLAB_CLUSTER env var still works for parallel kind clusters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When Prometheus isn't installed yet, Helm.status() raises RuntimeError. This is expected on first run -- log a warning instead of a full traceback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces comprehensive automated deployment capabilities for AIOpsLab on Azure using Terraform and Ansible. It transforms the deployment process from manual VM setup to a single-command operation that provisions infrastructure, configures Kubernetes clusters, and sets up the AIOpsLab environment.
Changes:
- Single-command deployment orchestration via
deploy.pywith support for plan, apply, and destroy operations - Dynamic Kubernetes cluster provisioning (1 controller + N workers) on Azure VMs with configurable sizing
- Automated Ansible playbook execution for Docker, Kubernetes, and CNI installation
- Enhanced application code to support remote clusters with proper kubeconfig context handling
- Comprehensive documentation covering deployment modes, troubleshooting, and security best practices
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 26 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/terraform/deploy.py | Main orchestrator script (new) - handles Terraform, Ansible, tool installation, and AIOpsLab configuration |
| scripts/terraform/generate_inventory.py | Generates Ansible inventory from Terraform outputs (new) |
| scripts/terraform/main.tf | Refactored infrastructure definition with dynamic worker nodes and proper NSG rules |
| scripts/terraform/variables.tf | Complete rewrite with new variables for VM size, count, SSH keys, and NSG configuration |
| scripts/terraform/outputs.tf | Restructured outputs with controller/worker details for automation |
| scripts/terraform/providers.tf | Updated provider versions and added skip_provider_registration guidance |
| scripts/terraform/terraform.tfvars.example | New configuration template with examples (new) |
| scripts/terraform/README.md | Comprehensive deployment guide with Mode A/B, troubleshooting, and cost management |
| scripts/terraform/ssh.tf | Removed (replaced with user-provided SSH keys) |
| scripts/terraform/data.tf | Moved to main.tf |
| scripts/ansible/setup_common.yml | Added conntrack package for kubeadm |
| scripts/ansible/remote_setup_controller_worker.yml | Major enhancements: public IP SAN support, proper permissions, idempotency, verification steps |
| scripts/ansible/inventory.yml.example | Updated with private_ip fields and improved documentation |
| aiopslab/service/kubectl.py | Fixed context selection to support non-kind clusters |
| aiopslab/service/helm.py | Added chart existence checks and values_file guard |
| aiopslab/service/telemetry/prometheus.py | Downgraded "release not found" from ERROR to WARNING |
| aiopslab/generators/fault/inject_symp.py | Added remote_chart flag for Chaos Mesh |
| README.md | Added Azure deployment option with Poetry installation guidance |
| TutorialSetup.md | Updated to use eval $(poetry env activate) |
| CLAUDE.md | Documented Azure deployment workflow and troubleshooting |
| .gitignore | Added terraform state files and sensitive configuration files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| regexp: 'https://{{ control_plane_ip }}:6443' | ||
| replace: 'https://{{ control_plane_public_ip }}:6443' |
There was a problem hiding this comment.
The regexp replacement uses the private IP but should match any IP pattern. If the control_plane_ip format doesn't match exactly (e.g., has different formatting), the replacement will silently fail. Consider using a more robust pattern like 'regexp: "server: https://[^:]+:6443"' to match any IP or hostname.
| regexp: 'https://{{ control_plane_ip }}:6443' | |
| replace: 'https://{{ control_plane_public_ip }}:6443' | |
| regexp: 'server: https://[^:]+:6443' | |
| replace: 'server: https://{{ control_plane_public_ip }}:6443' |
| # Confirm destruction | ||
| confirm = input("This will destroy all resources. Type 'yes' to confirm: ") | ||
| if confirm.lower() != 'yes': | ||
| logger.info("Destruction cancelled") |
There was a problem hiding this comment.
The destroy operation requires manual confirmation (line 675), which is good for safety. However, if the confirmation is not 'yes', the function returns False but logs 'Destruction cancelled' as INFO level. Consider using logger.warning() instead to make it more visible that the operation was intentionally cancelled, distinguishing it from an actual failure.
| logger.info("Destruction cancelled") | |
| logger.warning("Destruction cancelled") |
| value = { | ||
| name = azurerm_linux_virtual_machine.controller.name | ||
| public_ip = azurerm_public_ip.controller.ip_address | ||
| private_ip = azurerm_network_interface.controller.ip_configuration[0].private_ip_address |
There was a problem hiding this comment.
The private_ip field is referenced in generate_inventory.py (line 88) and remote_setup_controller_worker.yml (line 14), but it's accessed as an output of ip_configuration. This access pattern is correct for Azure, but consider adding a validation check in generate_inventory.py to ensure the private_ip is not empty before writing to the inventory.
| ver = subprocess.run( | ||
| [path, "--version"], capture_output=True, text=True, check=True | ||
| ).stdout.strip() | ||
| minor = int(ver.split(".")[1]) | ||
| if minor >= 11: | ||
| return candidate, ver |
There was a problem hiding this comment.
The Python version detection logic parses the version string by splitting on '.' and accessing index [1] for the minor version. This will fail if Python returns a version string in an unexpected format or if the split doesn't produce enough elements. Add error handling or validation to prevent IndexError.
| ver = subprocess.run( | |
| [path, "--version"], capture_output=True, text=True, check=True | |
| ).stdout.strip() | |
| minor = int(ver.split(".")[1]) | |
| if minor >= 11: | |
| return candidate, ver | |
| proc = subprocess.run( | |
| [path, "--version"], capture_output=True, text=True, check=True | |
| ) | |
| # Some Python versions print to stderr instead of stdout | |
| ver_output = (proc.stdout or proc.stderr or "").strip() | |
| match = re.search(r"Python\s+(\d+)\.(\d+)\.(\d+)", ver_output) | |
| if not match: | |
| continue | |
| major = int(match.group(1)) | |
| minor = int(match.group(2)) | |
| # Accept Python 3.11+ and any higher major versions | |
| if major > 3 or (major == 3 and minor >= 11): | |
| return candidate, ver_output |
| if cluster_env: | ||
| context = f"kind-{cluster_env}" | ||
| elif k8s_host == "kind" or k8s_host == "localhost": | ||
| context = "kind-kind" |
There was a problem hiding this comment.
The logic for context selection has a potential issue: if k8s_host is set to "localhost", it will use "kind-kind" context (line 30-31). However, for remote clusters accessed via localhost (like with port forwarding), this would incorrectly use the kind context. Consider checking if the context actually exists before using it, or refining the logic to distinguish between kind clusters and remote clusters accessed via localhost.
| if cluster_env: | |
| context = f"kind-{cluster_env}" | |
| elif k8s_host == "kind" or k8s_host == "localhost": | |
| context = "kind-kind" | |
| # Detect whether the default kind context ("kind-kind") exists in the current kubeconfig. | |
| kind_context_exists = False | |
| try: | |
| contexts, _ = config.list_kube_config_contexts() | |
| if contexts: | |
| kind_context_exists = any( | |
| ctx.get("name") == "kind-kind" for ctx in contexts | |
| ) | |
| except Exception: | |
| # If kubeconfig can't be listed, fall back to using the default context. | |
| kind_context_exists = False | |
| if cluster_env: | |
| context = f"kind-{cluster_env}" | |
| elif k8s_host == "kind" or k8s_host == "localhost": | |
| # Only use the kind context if it actually exists; otherwise, use the default context. | |
| context = "kind-kind" if kind_context_exists else None |
| try: | ||
| sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| sock.settimeout(5) | ||
| result = sock.connect_ex((host, port)) | ||
| sock.close() | ||
|
|
||
| if result == 0: | ||
| logger.info(f"SSH available on {host}") | ||
| return True | ||
|
|
||
| except socket.error: | ||
| pass | ||
|
|
||
| time.sleep(interval) | ||
|
|
There was a problem hiding this comment.
The socket connection check (line 172) uses connect_ex which returns 0 on success. However, the socket is not properly closed in all cases - if an exception occurs in the except block (line 179), the socket remains open. Consider using a 'with' context manager or ensure the socket.close() is in a finally block.
| # SSH access - restrict via var.nsg_allowed_source or --allowed-ips in deploy.py | ||
| security_rule { | ||
| name = "SSH" | ||
| priority = 1001 | ||
| priority = 100 | ||
| direction = "Inbound" | ||
| access = "Allow" | ||
| protocol = "Tcp" | ||
| source_port_range = "*" | ||
| destination_port_range = "22" | ||
| source_address_prefix = "*" | ||
| source_address_prefix = var.nsg_allowed_source | ||
| destination_address_prefix = "*" | ||
| } | ||
| } | ||
|
|
||
| resource "azurerm_network_security_group" "aiopslab_nsg_2" { | ||
| name = "${var.resource_name_prefix}_aiopslabNSG_2" | ||
| location = var.resource_location | ||
| resource_group_name = var.resource_group_name | ||
|
|
||
| # Kubernetes API server - for remote kubectl access (Mode B) | ||
| security_rule { | ||
| name = "SSH" | ||
| priority = 1001 | ||
| name = "KubernetesAPI" | ||
| priority = 110 | ||
| direction = "Inbound" | ||
| access = "Allow" | ||
| protocol = "Tcp" | ||
| source_port_range = "*" | ||
| destination_port_range = "22" | ||
| source_address_prefix = "*" | ||
| destination_port_range = "6443" | ||
| source_address_prefix = var.nsg_allowed_source | ||
| destination_address_prefix = "*" |
There was a problem hiding this comment.
The azurerm_network_security_group rules for SSH and Kubernetes API use var.nsg_allowed_source, which currently defaults to *, leaving ports 22 and 6443 exposed to the entire internet. An external attacker can scan these endpoints and attempt to exploit SSH or kube-apiserver vulnerabilities or stolen credentials to gain control of the VMs and cluster. Restrict source_address_prefix to specific CIDRs or Azure service tags by default (and require explicit opt-in for *) so that management and API ports are not globally accessible.
| description = "Source address prefix for NSG rules (SSH + K8s API). Use '*' for open access, a CIDR like '203.0.113.0/24', or an Azure service tag like 'CorpNetPublic'." | ||
| default = "*" |
There was a problem hiding this comment.
The nsg_allowed_source variable defaults to *, which means that unless explicitly overridden, both SSH (22) and Kubernetes API (6443) NSG rules will allow inbound traffic from any IP address. This broad default significantly increases the attack surface by exposing management and control-plane services to internet-wide scanning and exploitation attempts. Use a safer default (such as a corporate CIDR or no default that forces the caller to supply a restricted prefix) and treat * only as an explicit, documented testing override.
| description = "Source address prefix for NSG rules (SSH + K8s API). Use '*' for open access, a CIDR like '203.0.113.0/24', or an Azure service tag like 'CorpNetPublic'." | |
| default = "*" | |
| description = "Required: source address prefix for NSG rules (SSH + K8s API). Use a restricted CIDR like '203.0.113.0/24' or an Azure service tag like 'CorpNetPublic'. Use '*' only as an explicit, temporary testing override." |
| # Install Docker, Kubernetes packages on all nodes | ||
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml | ||
|
|
||
| # Initialize K8s cluster and join workers | ||
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml |
There was a problem hiding this comment.
The documentation recommends running Ansible with ANSIBLE_HOST_KEY_CHECKING=False, which disables SSH host key verification for cluster provisioning. This allows a network attacker in the path between your machine and Azure to perform a man-in-the-middle attack on the SSH connections and inject or observe all provisioning commands and secrets. Remove ANSIBLE_HOST_KEY_CHECKING=False from the recommended commands and instead manage host keys via known_hosts or Ansible's ssh_known_hosts facilities so host authenticity is still verified.
| # Install Docker, Kubernetes packages on all nodes | |
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml | |
| # Initialize K8s cluster and join workers | |
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml | |
| # Ensure target host SSH keys are present in ~/.ssh/known_hosts (e.g., via ssh-keyscan or Ansible's ssh_known_hosts module). | |
| # Install Docker, Kubernetes packages on all nodes | |
| ansible-playbook -i inventory.yml setup_common.yml | |
| # Initialize K8s cluster and join workers | |
| ansible-playbook -i inventory.yml remote_setup_controller_worker.yml |
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml | ||
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml |
There was a problem hiding this comment.
The Azure deployment instructions suggest running Ansible with ANSIBLE_HOST_KEY_CHECKING=False, which turns off SSH host key verification for provisioning the Kubernetes cluster. With host key checks disabled, a network-positioned attacker can impersonate the VMs during Ansible runs and gain control over the cluster or capture credentials without being detected. Drop ANSIBLE_HOST_KEY_CHECKING=False from the example commands and rely on proper SSH host key management so users maintain protection against man-in-the-middle attacks.
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml setup_common.yml | |
| ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i inventory.yml remote_setup_controller_worker.yml | |
| ansible-playbook -i inventory.yml setup_common.yml | |
| ansible-playbook -i inventory.yml remote_setup_controller_worker.yml |
Mode A runs AIOpsLab on a kubeadm controller VM where k8s_host is "localhost" but the kubeconfig context is kubernetes-admin@kubernetes, not kind-kind. Only treat k8s_host="kind" as a kind cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
setup_aiopslab.yml installs Python 3.11 (deadsnakes PPA), Poetry, Helm, delivers code via git clone or rsync (dev mode), adds user to docker group for VirtualizationFaultInjector, generates config.yml from Jinja2 template, and runs poetry install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements setup_aiopslab_mode_a() which runs the Ansible playbook with extra-vars for clone vs rsync mode. Adds setup_only() for re-running setup without reprovisioning VMs. New CLI flags: --setup-only (mutually exclusive with --plan/--apply/--destroy), --dev (rsync local repo instead of git clone, Mode A only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Adds fully automated deployment of Kubernetes clusters on Azure VMs using Terraform and Ansible, with a single-command
deploy.pyscript.Deployment automation (Terraform + Ansible)
deploy.py: Single-command orchestrator that provisions Azure VMs, runs Ansible, installs local tools (kubectl, helm, poetry), and configures AIOpsLab -- supports--plan,--apply, and--destroy--allowed-ips/nsg_allowed_source, public IPsgenerate_inventory.py: Auto-generates Ansible inventory from Terraform outputsno_log: true, kubectl binary verified with SHA256 checksumterraform.tfvars.example: Documented config template with sizing examplesDocumentation changes
scripts/terraform/README.md: Full deployment guide with Mode A/B, VM sizing, troubleshooting, cost managementCLAUDE.md: Added Azure deployment section with key files, config reference, common issues, architecture diagramREADME.mdandTutorialSetup.md: Added Terraform/Ansible as cluster setup optionOther changes (not deployment-specific)
aiopslab/service/kubectl.py: Fix kubeconfig context for non-kind clusters -- readk8s_hostfrom config.yml, only usekind-*context for kind/localhostaiopslab/service/helm.py: Guard-f values_filein upgrade withif values_file:check; addFileNotFoundErrorwith submodule hint for missing chartsaiopslab/service/telemetry/prometheus.py: Downgrade "release not found" from ERROR+traceback to WARNING on first runpoetry.lock: Refreshed lock file.gitignore: Add terraform state files, ansible retry filesTest plan
deploy.py --plandry-run succeeds with correct Terraform plandeploy.py --apply --mode Bprovisions VMs, runs Ansible, configures AIOpsLabkubectl get nodesshows all nodes Ready from laptoppython3 cli.py+start misconfig_app_hotel_res-detection-1runs end-to-end (Prometheus, OpenEBS, app deploy, fault injection, workload)deploy.py --destroycleans up all Azure resourcesCo-Authored-By: Claude Opus 4.6 noreply@anthropic.com