Hi, @grosser!
We are observing that from time to time the build never finishes and hang for hours until it times out in CI (after 3 hours).
The build seems fine, most of the times it works but here and there it gets stuck.
We have observed this at Semaphore and also on Jenkins but I was never able to reproduce it on my local machine.
The CI output does not provide much help, it is always something similar to:
...
-> Process 1 finishes and outputs here
FFF -> some failures
Process 2 simple hangs here and never finishes or dump the failures
We are sure it's not about a single spec getting stuck because we have this in place:
RSpec.configure do |config|
config.around(:each) do |example|
Timeout.timeout(5.minutes) do
example.run
end
end
end
While this is not perfect, it ensures that a single it never takes more than 5 minutes and if so the spec fails and we can see it, it's very rare.
I added this to our spec_helper.rb:
Signal.trap("INFO") do
Thread.list.each do |thread|
puts "Thread TID-#{(thread.object_id ^ ::Process.pid).to_s(36)} #{thread.name}"
if thread.backtrace
puts thread.backtrace.join("\n")
else
puts "<no backtrace available>"
end
end
end
Shameless copied from Sidekiq here.
But the issue is that sending any signal to the parallel_spec pid does not propagate this to the child processes but still I don't think it's a single spec getting stuck.
Do you think it makes sense to propagate at least some signals, by default or configurable, to the child processes?
I don't think this is the issue, otherwise we would simple see the timeout error, but Timeout is known to corrupt memory state here and there and this might be happening to parallel or parallel_tests despite of seeing some Thread.handle_interrupt in parallel.
So, a second request could be to have something to this INFO signal on parallel or parallel_tests to dump the thread state and then we can do something trivial on CI (pseudo):
bundle exec parallel_rspec &
pid=$1
while kill -0 $pid; do
if beyond_60_minutes; then
kill -9 $pid # kill abruptly
else if beyond_55_minutes; then
kill $pid # try to interrupt
else if beyond_50_minutes; then
kill -INFO $pid # try to dump where it is stuck
fi;
sleep 30
done
That way we can try to debug parallel_tests if it got stuck or if it was actually our source code that got stuck.
WDYT?
We are using:
parallel: 1.24.0
parallel_tests: 4.4.0
Found other issues mentioning something similar: #372 #74
Hi, @grosser!
We are observing that from time to time the build never finishes and hang for hours until it times out in CI (after 3 hours).
The build seems fine, most of the times it works but here and there it gets stuck.
We have observed this at Semaphore and also on Jenkins but I was never able to reproduce it on my local machine.
The CI output does not provide much help, it is always something similar to:
We are sure it's not about a single spec getting stuck because we have this in place:
While this is not perfect, it ensures that a single
itnever takes more than 5 minutes and if so the spec fails and we can see it, it's very rare.I added this to our
spec_helper.rb:Shameless copied from Sidekiq here.
But the issue is that sending any signal to the
parallel_specpid does not propagate this to the child processes but still I don't think it's a single spec getting stuck.Do you think it makes sense to propagate at least some signals, by default or configurable, to the child processes?
I don't think this is the issue, otherwise we would simple see the timeout error, but Timeout is known to corrupt memory state here and there and this might be happening to parallel or parallel_tests despite of seeing some
Thread.handle_interruptin parallel.So, a second request could be to have something to this INFO signal on parallel or parallel_tests to dump the thread state and then we can do something trivial on CI (pseudo):
That way we can try to debug parallel_tests if it got stuck or if it was actually our source code that got stuck.
WDYT?
We are using:
parallel: 1.24.0
parallel_tests: 4.4.0
Found other issues mentioning something similar: #372 #74