Improve TLS codegen by marking the panic/init path as cold#143511
Conversation
|
rustbot has assigned @workingjubilee. Use |
|
Not sure if this will show up at all on perf but 🤷 @bors2 try @rust-timer queue Do you have any local benchmarks? |
This comment has been minimized.
This comment has been minimized.
Improve TLS codegen by marking the panic/init path as cold This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever. These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
|
@compiler-errors No I don't have any local benchmarks. But I look at assembly output a lot, and trust me when I say these code paths should never get inlined. Could you restart the benchmark with my second commit included? |
|
@bors2 try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Improve TLS codegen by marking the panic/init path as cold This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever. These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (8b17150): comparison URL. Overall result: ❌✅ regressions and improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 5.4%, secondary 2.4%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (primary 2.6%, secondary -2.8%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.0%, secondary 0.1%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 459.09s -> 461.518s (0.53%) |
|
I removed some Codegen is still nicer just due to the addition of |
|
@bors2 try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Improve TLS codegen by marking the panic/init path as cold This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever. These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (9782d0a): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResults (primary 0.0%, secondary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 461.809s -> 462.209s (0.09%) |
| /// The resulting pointer may not be used after reentrant inialialization | ||
| /// or thread destruction has occurred. | ||
| #[inline] | ||
| pub fn get(&'static self, i: Option<&mut Option<T>>, f: impl FnOnce() -> T) -> *const T { |
There was a problem hiding this comment.
While you're at it, I think it might be beneficial to inline the ptr.addr() == 1 case into this function, as that might yield more optimized LocalKey::withs.
| if let State::Alive = self.state.get() { | ||
| self.val.get() | ||
| } else { | ||
| unsafe { self.get_or_init_slow() } |
There was a problem hiding this comment.
I don't think this is beneficial – the returned pointer is later compared against null in LocalKey::with anyway, so the optimiser should be able to merge the state comparison into that.
There was a problem hiding this comment.
I think it is beneficial. I think anything that is not the initialized path should be marked cold and gotten out of the way, even if that makes the non-initialized path slightly slower and have duplicated work.
The initialized hot path is what matters 99.999% of the time and should be prioritized over all else.
There was a problem hiding this comment.
I've made this toy example to illustrate this: https://rust.godbolt.org/z/hd6hnGWGh.
Note that because UnsafeCell::get cannot return a null pointer, the fast-path once inlined completely eliminates the nullptr check and only checks the state.
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
Sorry for the delay. I've rebased on the latest main and addressed the review comments (either solving their concern or contesting). I've also made one small change, I've explicitly assigned discriminant 0 to the @rustbot label -S-waiting-on-author +S-waiting-on-review |
|
I tested the performance of my idea, and it truly appears to be worse. Let's do a final perf-run of this, and then this should be good to go. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Improve TLS codegen by marking the panic/init path as cold
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (4fa0244): comparison URL. Overall result: no relevant changes - no action neededBenchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up. @rustbot label: -S-waiting-on-perf -perf-regression Instruction countThis perf run didn't have relevant results for this metric. Max RSS (memory usage)Results (secondary -4.9%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesThis perf run didn't have relevant results for this metric. Binary sizeThis perf run didn't have relevant results for this metric. Bootstrap: 516.093s -> 514.238s (-0.36%) |
|
Well, that's underwhelming... But I think I'll merge this anyway, it makes the implementations more consistent and is nice to have in general... @bors r+ rollup |
…uwer Rollup of 13 pull requests Successful merges: - #147302 (asm! support for the Xtensa architecture) - #148820 (Add very basic "comptime" fn implementation) - #157299 (Fix unstable diagnostics in tests) - #143511 (Improve TLS codegen by marking the panic/init path as cold) - #154608 (Add `_value` API for number literals in proc-macro) - #156762 (xfs support in `test_rename_directory_to_non_empty_directory`) - #157300 (Relax test requirements for consistency) - #157383 (tests: codegen-llvm: Ignore BPF targets in c-variadic-opt) - #157413 (fix: don't suggest .into_iter() for .cloned()/.copied() on non-reference Option) - #157578 (Fix diagnostics for non-exhaustive destructuring assignments (#157553)) - #157587 (explain that the size_of constant also serves to avoid optimizing away 'unused' size_of calls) - #157596 (test: remove ineffective link-extern-crate-with-drop-type test) - #157602 (rustdoc: Remove unnecessary fast path)
Rollup merge of #143511 - orlp:tls-cold-init, r=joboet Improve TLS codegen by marking the panic/init path as cold This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever. These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
|
@rust-timer build 6f652bc |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (6f652bc): comparison URL. Overall result: no relevant changes - no action neededBenchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up. @rustbot label: -S-waiting-on-perf -perf-regression Instruction countThis perf run didn't have relevant results for this metric. Max RSS (memory usage)Results (primary -1.3%, secondary 6.8%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -0.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 517.572s -> 515.951s (-0.31%) |
View all comments
This is an extension of the performance improvements seen from #141685. I noticed that the non-
constTLS still didn't have the#[cold]attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.These paths are taken either only once per thread (
init) or never (panic, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to#[cold]I added the more aggressive#[inline(never)]to both cold paths as well.