Prevent prometheus_client deadlocks during signal handling by the gunicorn arbiter by suligap · Pull Request #610 · canonical-ols/talisker

suligap · 2024-12-04T11:07:29Z

We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client.

So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode prometheus client lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock.

So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn).

More details on prometheus client behavior in
prometheus/client_python#1076

We have been affected by deadlocks caused by talisker instrumenting the gunicorn main process' logging with sentry and related prometheus metrics. The deadlocks were on a lock in prometheus_client. So this is a single threaded context in which the main process was sometimes in the middle of incrementing a prometheus counter as a result of sending an event to sentry as a result of gunicorn main process logging an error log about eg. terminating an unresponsive worker. In the middle of that when the global multiprocessing mode prometheus client lock in values.py was held, a signal handler for SIGCHLD was invoked in that main process in response to some other worker terminating. During that signal handling the main process also logs an error message which caused the sentry event and a corresponding prometheus counter update -> deadlock. So in order to be more careful/conservative with what we do during signal handling, or in this case what we instrument gunicorn to do, this change sets up all of the requests related metrics to not be emitted from the process they were created in (which for these metrics is the gunicorn arbiter if we're running gunicorn). More details on prometheus client behavior in prometheus/client_python#1076

suligap · 2024-12-04T13:43:45Z

Closing in favor of #611

suligap mentioned this pull request Dec 4, 2024

Stop using TaliskerRequestsTransport #611

Merged

suligap closed this Dec 4, 2024

suligap deleted the prevent-prometheus-deadlocks branch December 5, 2024 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent prometheus_client deadlocks during signal handling by the gunicorn arbiter#610

Prevent prometheus_client deadlocks during signal handling by the gunicorn arbiter#610
suligap wants to merge 1 commit intomasterfrom
prevent-prometheus-deadlocks

suligap commented Dec 4, 2024

Uh oh!

suligap commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

suligap commented Dec 4, 2024

Uh oh!

suligap commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant