Skip to content

Freeze On Some Systems During Demo Read #3

@ghost

Description

It looks like reading demos while Portal 2 is also trying to access a demo is causing a freeze on some systems. The user who's been experiencing this freeze gets a complete system lock-up.
This freeze occurs on splitting (reading a completed demo) and detecting a new demo. I tried adding a sleep, but this only fixes the first case. The second case can still occur since Portal 2 is always touching a live demo.
At this point, the only way I can think of fixing this is to change how demos are read. The code should detect a new demo has been started without trying to read it, wait until it's been fully written to (i.e., a even more new demo has been started), and then read it. This leaves a corner case with the last split, but perhaps I can think of something to resolve that case.

I wrote the following message for my explanation of what's going on. I think some parts of it are subtly wrong at this point, but it's pretty close to accurate.

The code looks at each demo in the folder, and one by one checks its last write time. If the demo hasn't been updated since the last check, the code takes a 16 millisecond nap, and then moves on to the next file. If the demo does look updated, it immediately tries to read it. If Portal 2 is in the middle of writing the demo (hence why its write time looks updated), and the splitter is also trying to read it, it'll deadlock. I've just now added a 5 second sleep before it tries to read it to avoid this scenario.
Now for your question: Does the number of demos matter? I think yes: the fewer demos you have, the more likely you'll see the problem. Why? Well, we check each demo every 16 * NUMBER_OF_DEMOS milliseconds. As NUMBER_OF_DEMOS increases, your checking each individual less often, and thus, are less likely to deadlock with the most recent demo. So, perhaps, Bill was defending against this by having so many demos.
This crash is essentially random, and decreases in likelihood as you build up more demos. The fact that you just happened to crash today only means your luck ran out. It's kinda like infant mortality: you're very likely to die before the age on 1. Just because you didn't die when the odds were highest doesn't mean you'll never die.

Do keep in mind that there are two different crashes: the one described above, and the original one. The original one is a bit more subtle. It's essentially doing a "timing attack". We split, then wait some number of milliseconds based on the number of demos in the folder, and then try to read the newest file. If Portal 2 takes 4,992 miliseconds to start writing to the new demo, and there are 312 demos that'll be read, then the splitter will try reading the newest demo at the exact same moment that Portal 2 is trying to write it, and deadlock.
So this bug you got today is random, but the old bug much more deterministic. You could probably reproduce the old bug with a fair amount of accuracy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions