Skip to content

feat: graceful SIGTERM/SIGINT worker drain (finish in-flight task before exit)#14

Open
adhikjoshi wants to merge 1 commit into
mainfrom
feat/graceful-worker-drain
Open

feat: graceful SIGTERM/SIGINT worker drain (finish in-flight task before exit)#14
adhikjoshi wants to merge 1 commit into
mainfrom
feat/graceful-worker-drain

Conversation

@adhikjoshi

Copy link
Copy Markdown
Contributor

Problem

run-workers registers a SIGTERM handler that only sets a flag, but the worker loop runs while True and never checks it, and the worker threads are daemon=True. So on SIGTERM the process exits and kills them mid-task, orphaning the in-flight task (and losing its result when the task handler uploads synchronously). This is a root cause of serverless replicas being scaled down mid-generation and 404-ing their output.

Change

  • Worker loop: while not self._shutdown_event.is_set() + blpop("ml_tasks", timeout=5) so shutdown is noticed within ~5s.
  • New ModelQ.shutdown(timeout=300): sets the event and join()s the worker threads so the current task finishes (and, for synchronous handlers, its upload completes) before exit.
  • CLI calls app_instance.shutdown() on SIGTERM/SIGINT.

Existing behavior is unchanged during normal operation; only the shutdown path drains gracefully. Pairs with gpulab-v2 graceful docker stop on scale-down and the flux-klein-server synchronous-upload fix.

Version bump 1.0.14 → 1.0.15. Publishing the GitHub release triggers the PyPI publish (cd.yml).

… before exit)

Worker loops ran `while True` and ignored the shutdown flag, and the threads are
daemons — so on SIGTERM the process exited and killed them mid-task, orphaning the
in-flight task (and losing its result if the task handler uploaded synchronously).

Worker loops now run `while not self._shutdown_event.is_set()` and blpop with a 5s
timeout so shutdown is noticed promptly. New ModelQ.shutdown() sets the event and
joins the worker threads (up to a timeout), and the CLI calls it on SIGTERM/SIGINT
so an in-flight task — and its upload — completes before the process exits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant