Description
Unikontainer.Kill() in pkg/unikontainers/unikontainers.go (lines 562–595) calls joinSandboxNetNs() without first locking the goroutine to its OS thread via runtime.LockOSThread(). This leaks the sandbox's TAP device, ingress qdisc, and tc redirect filters on every successful kill.
Root cause
joinSandboxNetNs performs unix.Setns(fd, CLONE_NEWNET), which mutates the network namespace of the calling OS thread only. In Go, a goroutine can be migrated between Ms (OS threads) at any scheduling point — every syscall entry/exit, channel op, or async preemption.
Between the setns and the netlink-backed network.CleanupAllUruncTaps() call later in Kill(), there are several scheduling points:
hypervisors.NewVMM(...) (allocations)
vmm.Stop(pid) → kill(2) syscall
network.CleanupAllUruncTaps() → netlink.NewHandle() → socket(AF_NETLINK) + bind(2)
If the goroutine has migrated by the time netlink.NewHandle() runs, the netlink socket is bound to whatever netns that thread is in (the host netns, in practice), not the sandbox netns. The ^tap\d+_urunc$ scan finds zero matches (the sandbox TAP lives in the sandbox netns), the loop is a no-op, and Kill() returns nil while the sandbox TAP, qdisc, and tc redirect filters all remain.
The contract was already documented on the function itself:
// joinSandboxNetns joins the network namespace of the sandbox
// This function should be called only from a locked thread
// (i.e. runtime. LockOSThread())
func (u Unikontainer) joinSandboxNetNs() error { ... }
Exec() at line 537 obeys this with runtime.LockOSThread(). Kill() simply forgot to.
Realistic trigger
Every nerdctl rm -f, kubectl delete pod, k8s eviction, Knative scale-to-zero, and CRI StopContainer flows through Unikontainer.Kill(). The race is deterministic on GOMAXPROCS > 1 (default everywhere) — Go's scheduler does migrate on kill(2) and socket(2) syscalls.
Impact
- TAP / tc-filter accumulation in long-lived sandbox netns: each
Kill() leaks one TAP device + four tc rules (ingress qdisc on tap, ingress qdisc on eth0, redirect filter tap→eth0, redirect filter eth0→tap). On nodes that restart unikernel containers (Knative revision rolls, livenessProbe restarts, CronJob replays), this exhausts the netns and leaves stale tc rules redirecting traffic to deleted ifindexes — manifesting as "the unikernel's network stops working after a restart".
- Silent failure:
CleanupAllUruncTaps() returns nil when no devices match, no error surfaces to urunc kill or containerd. Invisible until the node degrades.
Reproduction
- Start a unikernel container:
nerdctl run -d --name uk1 --runtime io.containerd.urunc.v2 <unikernel-image>
PID=$(nerdctl inspect uk1 -f '{{.State.Pid}}')
nsenter -t $PID -n ip link show | grep urunc # tap0_urunc present
- Kill the container:
- The sandbox netns persists (held by pause container in k8s, or by other refs). Re-enter and verify:
nsenter -t $PID -n ip link show
nsenter -t $PID -n tc qdisc show
The TAP device, ingress qdiscs, and tc redirect filters are still present.
- Loop to make accumulation visible:
for i in $(seq 50); do
nerdctl run -d --name uk$i --runtime io.containerd.urunc.v2 <image>
nerdctl kill uk$i
done
nsenter -t <sandbox-pid> -n ip link | grep urunc | wc -l # 50, expected 0
Proposed fix
Add runtime.LockOSThread() + defer runtime.UnlockOSThread() at the top of Kill(), and snapshot/restore the original netns so the locked OS thread is not handed back to the runtime pool while still pointing at the sandbox netns.
PR: #588
Related context
Checklist
Description
Unikontainer.Kill()inpkg/unikontainers/unikontainers.go(lines 562–595) callsjoinSandboxNetNs()without first locking the goroutine to its OS thread viaruntime.LockOSThread(). This leaks the sandbox's TAP device, ingress qdisc, and tc redirect filters on every successful kill.Root cause
joinSandboxNetNsperformsunix.Setns(fd, CLONE_NEWNET), which mutates the network namespace of the calling OS thread only. In Go, a goroutine can be migrated between Ms (OS threads) at any scheduling point — every syscall entry/exit, channel op, or async preemption.Between the
setnsand the netlink-backednetwork.CleanupAllUruncTaps()call later inKill(), there are several scheduling points:hypervisors.NewVMM(...)(allocations)vmm.Stop(pid)→kill(2)syscallnetwork.CleanupAllUruncTaps()→netlink.NewHandle()→socket(AF_NETLINK)+bind(2)If the goroutine has migrated by the time
netlink.NewHandle()runs, the netlink socket is bound to whatever netns that thread is in (the host netns, in practice), not the sandbox netns. The^tap\d+_urunc$scan finds zero matches (the sandbox TAP lives in the sandbox netns), the loop is a no-op, andKill()returnsnilwhile the sandbox TAP, qdisc, and tc redirect filters all remain.The contract was already documented on the function itself:
Exec()at line 537 obeys this withruntime.LockOSThread().Kill()simply forgot to.Realistic trigger
Every
nerdctl rm -f,kubectl delete pod, k8s eviction, Knative scale-to-zero, and CRIStopContainerflows throughUnikontainer.Kill(). The race is deterministic onGOMAXPROCS > 1(default everywhere) — Go's scheduler does migrate onkill(2)andsocket(2)syscalls.Impact
Kill()leaks one TAP device + four tc rules (ingress qdisc on tap, ingress qdisc on eth0, redirect filter tap→eth0, redirect filter eth0→tap). On nodes that restart unikernel containers (Knative revision rolls, livenessProbe restarts, CronJob replays), this exhausts the netns and leaves stale tc rules redirecting traffic to deleted ifindexes — manifesting as "the unikernel's network stops working after a restart".CleanupAllUruncTaps()returnsnilwhen no devices match, no error surfaces tourunc killor containerd. Invisible until the node degrades.Reproduction
Proposed fix
Add
runtime.LockOSThread()+defer runtime.UnlockOSThread()at the top ofKill(), and snapshot/restore the original netns so the locked OS thread is not handed back to the runtime pool while still pointing at the sandbox netns.PR: #588
Related context
CleanupAllUruncTapsto scan via regex but did not address thread-locking.LockOSThreadpredates and survived both.Checklist