feat(cell): add first-class TLS support for multigateway via cert-manager#462
feat(cell): add first-class TLS support for multigateway via cert-manager#462haritabh17 merged 7 commits intomainfrom
Conversation
…ager When MultigresClusterSpec.CertCommonName is set, the cell controller now creates a cert-manager Certificate (using supabase-issuer ClusterIssuer) and mounts the resulting TLS secret into multigateway pods with --pg-tls-cert-file and --pg-tls-key-file flags. This enables `psql sslmode=require` for HA projects without manual cert patching. The Certificate spec matches the non-HA project convention (RSA 2048, 44640h duration, generated-certs secret, literalSubject with Supabase org fields). TLS is fully gated on certCommonName so standalone multigres clusters without cert-manager continue to work unchanged. Closes MUL-341 Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
42f7c15 to
c19f170
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
c47f44a to
b383d2c
Compare
This comment has been minimized.
This comment has been minimized.
Verolop
left a comment
There was a problem hiding this comment.
nice work!
Before merging, we need to have one clear owner for certificate lifecycle. Right now the config comes from one scope and the resource is managed from another, which creates ambiguity during updates and deletion paths.
Move the cert-manager Certificate lifecycle from the cell controller to the MultigresCluster controller so there is exactly one reconciler and one ownerRef per certificate. Previously, every cell reconciled the same Certificate, causing ownerRef flapping between cells. The MultigresCluster now owns the Certificate, which means: - No multi-owner conflict when multiple cells share the same CN - Cleanup is deterministic: unsetting certCommonName deletes the Certificate (only if owned by this cluster) - Kubernetes GC properly cascades on cluster deletion Addresses review feedback on PR #462. Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Move the cert-manager Certificate lifecycle from the cell controller to the MultigresCluster controller so there is exactly one reconciler and one ownerRef per certificate. Previously, every cell reconciled the same Certificate, causing ownerRef flapping between cells. The MultigresCluster now owns the Certificate, which means: - No multi-owner conflict when multiple cells share the same CN - Cleanup is deterministic: unsetting certCommonName deletes the Certificate (only if owned by this cluster) - Kubernetes GC properly cascades on cluster deletion Addresses review feedback on PR #462. Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
e158fa2 to
96f7459
Compare
This comment has been minimized.
This comment has been minimized.
…urrent map panic The fake client auto-registers unknown unstructured GVKs at runtime, which mutates the scheme's internal map. When parallel integration tests share a scheme and reconcileCertificate calls List(), this causes a concurrent map write panic. Pre-registering the cert-manager Certificate GVK in setupScheme() prevents the runtime mutation. Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
The RBAC marker was in the controller file which controller-gen doesn't scan (it only scans ./api/... and ./pkg/webhook/...). Moved the marker to multigrescluster_types.go so it's picked up, and added the delete verb needed by reconcileCertificate cleanup. Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
| ) error { | ||
| logger := log.FromContext(ctx) | ||
|
|
||
| if cluster.Spec.CertCommonName != "" { |
There was a problem hiding this comment.
one last thing: on certCommonName changes, reconcile should also remove the cert from the previous CN. Right now, since the cert name is the CN, the change creates a second Certificate instead of updating the existing one. The old cert is still owned by the same MultigresCluster, and ownerRef GC won’t remove it unless the cluster itself is deleted, so it can stick around and still target secretName: generated-certs.
There was a problem hiding this comment.
Ack, extracted deleteOwnedCertificates(ctx, cluster, keepName) which runs before the apply step on every reconcile.
When certCommonName changes, the old Certificate (named after the previous CN) was left behind, still targeting the same generated-certs secret. Now reconcileCertificate deletes any Certificates owned by this cluster whose name doesn't match the current CN before applying the new one. This covers both CN changes and CN removal. Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
- B1: Move CertSecretName to api/v1alpha1 as a single shared constant
so the Certificate secretName and Deployment volume can't silently
diverge.
- B2: Replace brittle string-matching in isNoMatchError with
errors.As(err, &meta.NoKindMatchError{}).
- B3: Check fc.Get errors with t.Fatal in the two-clusters test instead
of swallowing them (prevents nil-slice panic).
- N1: Update CertCommonName godoc in both CRD types to say "cluster
controller" instead of "cell controller".
Signed-off-by: Haritabh Gupta <20576107+haritabh17@users.noreply.github.com>
🔬 Go Test Coverage ReportSummary
Status✅ PASS DetailShow New Coverage |
Summary
certCommonNamefield toMultigresClusterSpecandCellSpecCRDsCertificate(usingsupabase-issuerClusterIssuer) whencertCommonNameis set--pg-tls-cert-file/--pg-tls-key-fileflagscertCommonName— clusters without cert-manager are unaffectedCloses MUL-341
Problem
The multigateway binary supports TLS via
--pg-tls-cert-file/--pg-tls-key-file(merged in multigres#659), but the operator wasn't creating certificates or passing these flags. This causedpsql sslmode=requireto fail with "server does not support SSL" for HA projects (MUL-326).Solution
When
certCommonNameis set on theMultigresClusterCR, the cell controller creates a cert-managerCertificatematching the non-HA project convention:supabase-issuer(ClusterIssuer)generated-certsdb.<ref>.<domain>+<ref>.<domain>The multigateway deployment then mounts the
generated-certssecret and passes the TLS flags. This supports all PostgreSQL SSL modes includingverify-full.The platform worker (separate repo) will pass
certCommonNamein theMultigresClusterCR — a 1-line change on their side.Test plan
TestBuildCertificate— Certificate resource name, DNS SANs, owner reference, issuer, key paramsTestReconcileCertificate— no-op when empty, creates via SSA, idempotent on repeated callsTestBuildMultiGatewayDeployment_TLS— TLS enabled (volume, mount, args) and disabled (no artifacts)certCommonName: db.<ref>.supabase.redand verifypsql sslmode=require