Skip to content

parallel_rspec hanging #1028

@sobrinho

Description

@sobrinho

Hi, @grosser!

We are observing that from time to time the build never finishes and hang for hours until it times out in CI (after 3 hours).

The build seems fine, most of the times it works but here and there it gets stuck.

We have observed this at Semaphore and also on Jenkins but I was never able to reproduce it on my local machine.

The CI output does not provide much help, it is always something similar to:

...
-> Process 1 finishes and outputs here
FFF -> some failures
Process 2 simple hangs here and never finishes or dump the failures

We are sure it's not about a single spec getting stuck because we have this in place:

RSpec.configure do |config|
  config.around(:each) do |example|
    Timeout.timeout(5.minutes) do
      example.run
    end
  end
end

While this is not perfect, it ensures that a single it never takes more than 5 minutes and if so the spec fails and we can see it, it's very rare.

I added this to our spec_helper.rb:

Signal.trap("INFO") do
  Thread.list.each do |thread|
    puts "Thread TID-#{(thread.object_id ^ ::Process.pid).to_s(36)} #{thread.name}"
    if thread.backtrace
      puts thread.backtrace.join("\n")
    else
      puts "<no backtrace available>"
    end
  end
end

Shameless copied from Sidekiq here.

But the issue is that sending any signal to the parallel_spec pid does not propagate this to the child processes but still I don't think it's a single spec getting stuck.

Do you think it makes sense to propagate at least some signals, by default or configurable, to the child processes?

I don't think this is the issue, otherwise we would simple see the timeout error, but Timeout is known to corrupt memory state here and there and this might be happening to parallel or parallel_tests despite of seeing some Thread.handle_interrupt in parallel.

So, a second request could be to have something to this INFO signal on parallel or parallel_tests to dump the thread state and then we can do something trivial on CI (pseudo):

bundle exec parallel_rspec &
pid=$1

while kill -0 $pid; do
  if beyond_60_minutes; then
    kill -9 $pid # kill abruptly
  else if beyond_55_minutes; then
    kill $pid # try to interrupt
  else if beyond_50_minutes; then
    kill -INFO $pid # try to dump where it is stuck
  fi;

  sleep 30
done

That way we can try to debug parallel_tests if it got stuck or if it was actually our source code that got stuck.

WDYT?


We are using:

parallel: 1.24.0
parallel_tests: 4.4.0

Found other issues mentioning something similar: #372 #74

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions