Skip to content

Prevent prometheus_client deadlocks during signal handling by the gunicorn arbiter#610

Closed
suligap wants to merge 1 commit intomasterfrom
prevent-prometheus-deadlocks
Closed

Prevent prometheus_client deadlocks during signal handling by the gunicorn arbiter#610
suligap wants to merge 1 commit intomasterfrom
prevent-prometheus-deadlocks

Conversation

@suligap
Copy link
Copy Markdown
Member

@suligap suligap commented Dec 4, 2024

We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode prometheus client lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock.

So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076

We have been affected by deadlocks caused by talisker instrumenting the
gunicorn main process' logging with sentry and related
prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was
sometimes in the middle of incrementing a prometheus counter as a result
of sending an event to sentry as a result of gunicorn main process
logging an error log about eg. terminating an unresponsive worker. In
the middle of that when the global multiprocessing mode prometheus
client lock in values.py was held, a signal handler for SIGCHLD was
invoked in that main process in response to some other worker
terminating. During that signal handling the main process also logs an
error message which caused the sentry event and a corresponding
prometheus counter update -> deadlock.

So in order to be more careful/conservative with what we do during
signal handling, or in this case what we instrument gunicorn to do, this
change sets up all of the requests related metrics to not be emitted
from the process they were created in (which for these metrics is the
gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076
@suligap
Copy link
Copy Markdown
Member Author

suligap commented Dec 4, 2024

Closing in favor of #611

@suligap suligap closed this Dec 4, 2024
@suligap suligap deleted the prevent-prometheus-deadlocks branch December 5, 2024 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant