Skip to content

fix: Kill() races on netns — missing runtime.LockOSThread() leaks sandbox TAP and tc filters #587

Description

@mn-ram

Description

Unikontainer.Kill() in pkg/unikontainers/unikontainers.go (lines 562–595) calls joinSandboxNetNs() without first locking the goroutine to its OS thread via runtime.LockOSThread(). This leaks the sandbox's TAP device, ingress qdisc, and tc redirect filters on every successful kill.

Root cause

joinSandboxNetNs performs unix.Setns(fd, CLONE_NEWNET), which mutates the network namespace of the calling OS thread only. In Go, a goroutine can be migrated between Ms (OS threads) at any scheduling point — every syscall entry/exit, channel op, or async preemption.

Between the setns and the netlink-backed network.CleanupAllUruncTaps() call later in Kill(), there are several scheduling points:

  • hypervisors.NewVMM(...) (allocations)
  • vmm.Stop(pid)kill(2) syscall
  • network.CleanupAllUruncTaps()netlink.NewHandle()socket(AF_NETLINK) + bind(2)

If the goroutine has migrated by the time netlink.NewHandle() runs, the netlink socket is bound to whatever netns that thread is in (the host netns, in practice), not the sandbox netns. The ^tap\d+_urunc$ scan finds zero matches (the sandbox TAP lives in the sandbox netns), the loop is a no-op, and Kill() returns nil while the sandbox TAP, qdisc, and tc redirect filters all remain.

The contract was already documented on the function itself:

// joinSandboxNetns joins the network namespace of the sandbox
// This function should be called only from a locked thread
// (i.e. runtime. LockOSThread())
func (u Unikontainer) joinSandboxNetNs() error { ... }

Exec() at line 537 obeys this with runtime.LockOSThread(). Kill() simply forgot to.

Realistic trigger

Every nerdctl rm -f, kubectl delete pod, k8s eviction, Knative scale-to-zero, and CRI StopContainer flows through Unikontainer.Kill(). The race is deterministic on GOMAXPROCS > 1 (default everywhere) — Go's scheduler does migrate on kill(2) and socket(2) syscalls.

Impact

  • TAP / tc-filter accumulation in long-lived sandbox netns: each Kill() leaks one TAP device + four tc rules (ingress qdisc on tap, ingress qdisc on eth0, redirect filter tap→eth0, redirect filter eth0→tap). On nodes that restart unikernel containers (Knative revision rolls, livenessProbe restarts, CronJob replays), this exhausts the netns and leaves stale tc rules redirecting traffic to deleted ifindexes — manifesting as "the unikernel's network stops working after a restart".
  • Silent failure: CleanupAllUruncTaps() returns nil when no devices match, no error surfaces to urunc kill or containerd. Invisible until the node degrades.

Reproduction

  1. Start a unikernel container:
    nerdctl run -d --name uk1 --runtime io.containerd.urunc.v2 <unikernel-image>
    PID=$(nerdctl inspect uk1 -f '{{.State.Pid}}')
    nsenter -t $PID -n ip link show | grep urunc   # tap0_urunc present
    
  2. Kill the container:
    nerdctl kill uk1
    
  3. The sandbox netns persists (held by pause container in k8s, or by other refs). Re-enter and verify:
    nsenter -t $PID -n ip link show
    nsenter -t $PID -n tc qdisc show
    
    The TAP device, ingress qdiscs, and tc redirect filters are still present.
  4. Loop to make accumulation visible:
    for i in $(seq 50); do
      nerdctl run -d --name uk$i --runtime io.containerd.urunc.v2 <image>
      nerdctl kill uk$i
    done
    nsenter -t <sandbox-pid> -n ip link | grep urunc | wc -l   # 50, expected 0
    

Proposed fix

Add runtime.LockOSThread() + defer runtime.UnlockOSThread() at the top of Kill(), and snapshot/restore the original netns so the locked OS thread is not handed back to the runtime pool while still pointing at the sandbox netns.

PR: #588

Related context

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions