|
| 1 | +# DNS Resolution Issues with S3 Proxy in Private VPC Deployments |
| 2 | + |
| 3 | +## Tags |
| 4 | + |
| 5 | +`dns`, `s3-proxy`, `ecs`, `network`, `private-vpc`, `awsvpc`, `troubleshooting` |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +When deploying Quilt in a private VPC with custom DNS configuration, the S3 proxy service may fail to resolve internal hostnames (including the internal registry and AWS S3 endpoints). This occurs because the s3-proxy container obtains its DNS resolver from `/etc/resolv.conf`, which may not include the AWS-provided DNS server (169.254.169.253 or VPC+2 address) when custom DHCP options are configured. |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Symptoms |
| 14 | + |
| 15 | +- **S3 proxy fails to connect to the internal registry** |
| 16 | + - Error: `could not resolve internal registry hostname` |
| 17 | + - Downloads from the Quilt catalog fail |
| 18 | + - Package operations may time out |
| 19 | + |
| 20 | +- **S3 proxy cannot resolve AWS S3 endpoints** |
| 21 | + - Requests to S3 buckets fail |
| 22 | + - Error logs show DNS resolution failures in nginx |
| 23 | + |
| 24 | +- **Observable indicators:** |
| 25 | + - ECS task logs show nginx resolver errors |
| 26 | + - `502 Bad Gateway` errors in the catalog |
| 27 | + - Package downloads consistently fail while other Quilt functionality works |
| 28 | + |
| 29 | +- **Common environment:** |
| 30 | + - Private VPC with custom DHCP options |
| 31 | + - On-premises DNS servers configured |
| 32 | + - VPN or Direct Connect to on-premises infrastructure |
| 33 | + - AWS-provided DNS (169.254.169.253) not included in DHCP options |
| 34 | + |
| 35 | +## Likely Causes |
| 36 | + |
| 37 | +### 1. Custom DHCP Options Excluding AWS DNS |
| 38 | + |
| 39 | +When customers configure custom DHCP option sets for their VPC that specify on-premises DNS servers without including AWS's DNS resolver, ECS tasks running in `awsvpc` network mode will not have access to AWS's DNS. |
| 40 | + |
| 41 | +The Quilt S3 proxy service uses nginx, which reads the nameserver from `/etc/resolv.conf` at startup: |
| 42 | + |
| 43 | +```bash |
| 44 | +# From s3-proxy/run-nginx.sh |
| 45 | +nameserver=$(awk '{if ($1 == "nameserver") { print $2; exit;}}' < /etc/resolv.conf) |
| 46 | +``` |
| 47 | + |
| 48 | +If this nameserver cannot resolve: |
| 49 | +- Internal AWS hostnames (e.g., S3 VPC endpoint DNS names) |
| 50 | +- Cloud Map service discovery names (e.g., `registry.${StackName}`) |
| 51 | + |
| 52 | +Then the S3 proxy will fail. |
| 53 | + |
| 54 | +### 2. VPC Endpoint Private DNS Not Resolving |
| 55 | + |
| 56 | +Even with an S3 VPC endpoint configured, if the task's DNS resolver cannot reach AWS's DNS infrastructure, private DNS names for the endpoint won't resolve. |
| 57 | + |
| 58 | +### 3. Service Discovery (Cloud Map) DNS Failures |
| 59 | + |
| 60 | +Quilt uses AWS Cloud Map for internal service discovery. The registry service registers as `registry.${AWS::StackName}` in a private DNS namespace. Resolving this name requires access to the Route 53 Resolver (AWS DNS). |
| 61 | + |
| 62 | +## Recommendation |
| 63 | + |
| 64 | +### Immediate Fix: Add AWS DNS to DHCP Options |
| 65 | + |
| 66 | +1. **Modify your VPC's DHCP option set** to include the AWS-provided DNS resolver alongside your custom DNS servers: |
| 67 | + |
| 68 | + **Option A**: Add `169.254.169.253` (works for EC2 instances) |
| 69 | + |
| 70 | + **Option B**: Add your VPC's DNS address at `<VPC_CIDR_BASE>+2` (e.g., `10.0.0.2` for a `10.0.0.0/16` VPC) |
| 71 | + |
| 72 | +2. **Update the DHCP options** in AWS Console or via CLI: |
| 73 | + |
| 74 | + ```bash |
| 75 | + aws ec2 create-dhcp-options \ |
| 76 | + --dhcp-configurations \ |
| 77 | + "Key=domain-name-servers,Values=10.0.0.2,YOUR_CUSTOM_DNS_1,YOUR_CUSTOM_DNS_2" |
| 78 | + ``` |
| 79 | + |
| 80 | +3. **Associate the new DHCP options** with your VPC and restart ECS tasks to pick up the new configuration. |
| 81 | + |
| 82 | +### Workaround: DNS Forwarding |
| 83 | + |
| 84 | +If you cannot modify DHCP options, configure your on-premises DNS servers to forward queries for AWS domains to the AWS DNS resolver: |
| 85 | + |
| 86 | +1. **Forward zones:** |
| 87 | + - `amazonaws.com` |
| 88 | + - `aws.amazon.com` |
| 89 | + - Your Cloud Map namespace (e.g., `your-stack-name`) |
| 90 | + |
| 91 | +2. Configure conditional forwarding to the Route 53 Resolver inbound endpoint. |
| 92 | + |
| 93 | +### Future Enhancement Request |
| 94 | + |
| 95 | +The customer has requested the ability to specify custom DNS servers as a CloudFormation parameter. This would involve adding `DnsServers` to the ECS task definitions: |
| 96 | + |
| 97 | +```yaml |
| 98 | +# Example of desired functionality |
| 99 | +Parameters: |
| 100 | + CustomDnsServers: |
| 101 | + Type: CommaDelimitedList |
| 102 | + Default: "" |
| 103 | + Description: "Custom DNS servers for ECS tasks (optional)" |
| 104 | +``` |
| 105 | +
|
| 106 | +This enhancement is being tracked internally. |
| 107 | +
|
| 108 | +## Debugging Steps |
| 109 | +
|
| 110 | +### 1. Verify DNS in the running container |
| 111 | +
|
| 112 | +If ECS Exec is enabled, connect to the s3-proxy container: |
| 113 | +
|
| 114 | +```bash |
| 115 | +aws ecs execute-command \ |
| 116 | + --cluster YOUR_CLUSTER \ |
| 117 | + --task TASK_ID \ |
| 118 | + --container s3-proxy \ |
| 119 | + --command "/bin/sh" \ |
| 120 | + --interactive |
| 121 | +``` |
| 122 | + |
| 123 | +Then check: |
| 124 | + |
| 125 | +```bash |
| 126 | +cat /etc/resolv.conf |
| 127 | +nslookup registry.YOUR_STACK_NAME |
| 128 | +nslookup s3.us-east-1.amazonaws.com |
| 129 | +``` |
| 130 | + |
| 131 | +### 2. Check CloudWatch Logs |
| 132 | + |
| 133 | +Look for DNS resolution errors in the s3-proxy log group: |
| 134 | + |
| 135 | +``` |
| 136 | +/quilt/${StackName}/s3-proxy |
| 137 | +``` |
| 138 | + |
| 139 | +Common error patterns: |
| 140 | +- `[error] ... could not be resolved` |
| 141 | +- `upstream timed out` |
| 142 | +- `no resolver defined to resolve` |
| 143 | + |
| 144 | +### 3. Verify VPC DNS Settings |
| 145 | + |
| 146 | +```bash |
| 147 | +aws ec2 describe-vpc-attribute \ |
| 148 | + --vpc-id YOUR_VPC_ID \ |
| 149 | + --attribute enableDnsSupport |
| 150 | + |
| 151 | +aws ec2 describe-vpc-attribute \ |
| 152 | + --vpc-id YOUR_VPC_ID \ |
| 153 | + --attribute enableDnsHostnames |
| 154 | +``` |
| 155 | + |
| 156 | +Both should return `true`. |
| 157 | + |
| 158 | +### 4. Check DHCP Options |
| 159 | + |
| 160 | +```bash |
| 161 | +aws ec2 describe-dhcp-options \ |
| 162 | + --dhcp-options-ids $(aws ec2 describe-vpcs --vpc-ids YOUR_VPC_ID \ |
| 163 | + --query 'Vpcs[0].DhcpOptionsId' --output text) |
| 164 | +``` |
| 165 | + |
| 166 | +Verify that `domain-name-servers` includes an AWS DNS resolver. |
| 167 | + |
| 168 | +## Related Issues |
| 169 | + |
| 170 | +- [AWS Documentation: DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html) |
| 171 | +- [AWS Documentation: DHCP options sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html) |
| 172 | +- [ECS Task Networking with awsvpc mode](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html) |
| 173 | + |
| 174 | +## See Also |
| 175 | + |
| 176 | +- JSON Encoding Error Hiding Permission Issues (related KB article) |
| 177 | +- Private VPC Deployment Best Practices |
0 commit comments