Skip to content

Commit 8af197a

Browse files
committed
Add KB articles for DNS resolution and permission error troubleshooting
Add two new troubleshooting articles based on customer support case: 1. DNS Resolution Issues with S3 Proxy in Private VPC Deployments - Documents issue where custom DHCP options exclude AWS DNS - Provides workarounds (DHCP options, DNS forwarding) - Includes debugging steps for ECS Exec, CloudWatch logs 2. JSON Encoding Error Masking Underlying Permission Issues - Documents how XML error responses from S3 can mask permission errors - Provides IAM debugging steps using CloudTrail and S3 access logs - Includes example bucket policies and common fixes
1 parent c2796af commit 8af197a

2 files changed

Lines changed: 411 additions & 0 deletions

File tree

trouble-dns-resolution-s3-proxy.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
# DNS Resolution Issues with S3 Proxy in Private VPC Deployments
2+
3+
## Tags
4+
5+
`dns`, `s3-proxy`, `ecs`, `network`, `private-vpc`, `awsvpc`, `troubleshooting`
6+
7+
## Summary
8+
9+
When deploying Quilt in a private VPC with custom DNS configuration, the S3 proxy service may fail to resolve internal hostnames (including the internal registry and AWS S3 endpoints). This occurs because the s3-proxy container obtains its DNS resolver from `/etc/resolv.conf`, which may not include the AWS-provided DNS server (169.254.169.253 or VPC+2 address) when custom DHCP options are configured.
10+
11+
---
12+
13+
## Symptoms
14+
15+
- **S3 proxy fails to connect to the internal registry**
16+
- Error: `could not resolve internal registry hostname`
17+
- Downloads from the Quilt catalog fail
18+
- Package operations may time out
19+
20+
- **S3 proxy cannot resolve AWS S3 endpoints**
21+
- Requests to S3 buckets fail
22+
- Error logs show DNS resolution failures in nginx
23+
24+
- **Observable indicators:**
25+
- ECS task logs show nginx resolver errors
26+
- `502 Bad Gateway` errors in the catalog
27+
- Package downloads consistently fail while other Quilt functionality works
28+
29+
- **Common environment:**
30+
- Private VPC with custom DHCP options
31+
- On-premises DNS servers configured
32+
- VPN or Direct Connect to on-premises infrastructure
33+
- AWS-provided DNS (169.254.169.253) not included in DHCP options
34+
35+
## Likely Causes
36+
37+
### 1. Custom DHCP Options Excluding AWS DNS
38+
39+
When customers configure custom DHCP option sets for their VPC that specify on-premises DNS servers without including AWS's DNS resolver, ECS tasks running in `awsvpc` network mode will not have access to AWS's DNS.
40+
41+
The Quilt S3 proxy service uses nginx, which reads the nameserver from `/etc/resolv.conf` at startup:
42+
43+
```bash
44+
# From s3-proxy/run-nginx.sh
45+
nameserver=$(awk '{if ($1 == "nameserver") { print $2; exit;}}' < /etc/resolv.conf)
46+
```
47+
48+
If this nameserver cannot resolve:
49+
- Internal AWS hostnames (e.g., S3 VPC endpoint DNS names)
50+
- Cloud Map service discovery names (e.g., `registry.${StackName}`)
51+
52+
Then the S3 proxy will fail.
53+
54+
### 2. VPC Endpoint Private DNS Not Resolving
55+
56+
Even with an S3 VPC endpoint configured, if the task's DNS resolver cannot reach AWS's DNS infrastructure, private DNS names for the endpoint won't resolve.
57+
58+
### 3. Service Discovery (Cloud Map) DNS Failures
59+
60+
Quilt uses AWS Cloud Map for internal service discovery. The registry service registers as `registry.${AWS::StackName}` in a private DNS namespace. Resolving this name requires access to the Route 53 Resolver (AWS DNS).
61+
62+
## Recommendation
63+
64+
### Immediate Fix: Add AWS DNS to DHCP Options
65+
66+
1. **Modify your VPC's DHCP option set** to include the AWS-provided DNS resolver alongside your custom DNS servers:
67+
68+
**Option A**: Add `169.254.169.253` (works for EC2 instances)
69+
70+
**Option B**: Add your VPC's DNS address at `<VPC_CIDR_BASE>+2` (e.g., `10.0.0.2` for a `10.0.0.0/16` VPC)
71+
72+
2. **Update the DHCP options** in AWS Console or via CLI:
73+
74+
```bash
75+
aws ec2 create-dhcp-options \
76+
--dhcp-configurations \
77+
"Key=domain-name-servers,Values=10.0.0.2,YOUR_CUSTOM_DNS_1,YOUR_CUSTOM_DNS_2"
78+
```
79+
80+
3. **Associate the new DHCP options** with your VPC and restart ECS tasks to pick up the new configuration.
81+
82+
### Workaround: DNS Forwarding
83+
84+
If you cannot modify DHCP options, configure your on-premises DNS servers to forward queries for AWS domains to the AWS DNS resolver:
85+
86+
1. **Forward zones:**
87+
- `amazonaws.com`
88+
- `aws.amazon.com`
89+
- Your Cloud Map namespace (e.g., `your-stack-name`)
90+
91+
2. Configure conditional forwarding to the Route 53 Resolver inbound endpoint.
92+
93+
### Future Enhancement Request
94+
95+
The customer has requested the ability to specify custom DNS servers as a CloudFormation parameter. This would involve adding `DnsServers` to the ECS task definitions:
96+
97+
```yaml
98+
# Example of desired functionality
99+
Parameters:
100+
CustomDnsServers:
101+
Type: CommaDelimitedList
102+
Default: ""
103+
Description: "Custom DNS servers for ECS tasks (optional)"
104+
```
105+
106+
This enhancement is being tracked internally.
107+
108+
## Debugging Steps
109+
110+
### 1. Verify DNS in the running container
111+
112+
If ECS Exec is enabled, connect to the s3-proxy container:
113+
114+
```bash
115+
aws ecs execute-command \
116+
--cluster YOUR_CLUSTER \
117+
--task TASK_ID \
118+
--container s3-proxy \
119+
--command "/bin/sh" \
120+
--interactive
121+
```
122+
123+
Then check:
124+
125+
```bash
126+
cat /etc/resolv.conf
127+
nslookup registry.YOUR_STACK_NAME
128+
nslookup s3.us-east-1.amazonaws.com
129+
```
130+
131+
### 2. Check CloudWatch Logs
132+
133+
Look for DNS resolution errors in the s3-proxy log group:
134+
135+
```
136+
/quilt/${StackName}/s3-proxy
137+
```
138+
139+
Common error patterns:
140+
- `[error] ... could not be resolved`
141+
- `upstream timed out`
142+
- `no resolver defined to resolve`
143+
144+
### 3. Verify VPC DNS Settings
145+
146+
```bash
147+
aws ec2 describe-vpc-attribute \
148+
--vpc-id YOUR_VPC_ID \
149+
--attribute enableDnsSupport
150+
151+
aws ec2 describe-vpc-attribute \
152+
--vpc-id YOUR_VPC_ID \
153+
--attribute enableDnsHostnames
154+
```
155+
156+
Both should return `true`.
157+
158+
### 4. Check DHCP Options
159+
160+
```bash
161+
aws ec2 describe-dhcp-options \
162+
--dhcp-options-ids $(aws ec2 describe-vpcs --vpc-ids YOUR_VPC_ID \
163+
--query 'Vpcs[0].DhcpOptionsId' --output text)
164+
```
165+
166+
Verify that `domain-name-servers` includes an AWS DNS resolver.
167+
168+
## Related Issues
169+
170+
- [AWS Documentation: DNS attributes for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html)
171+
- [AWS Documentation: DHCP options sets](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_DHCP_Options.html)
172+
- [ECS Task Networking with awsvpc mode](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-networking.html)
173+
174+
## See Also
175+
176+
- JSON Encoding Error Hiding Permission Issues (related KB article)
177+
- Private VPC Deployment Best Practices

0 commit comments

Comments
 (0)