Skip to content

Improve TLS codegen by marking the panic/init path as cold#143511

Merged
rust-bors[bot] merged 5 commits into
rust-lang:mainfrom
orlp:tls-cold-init
Jun 8, 2026
Merged

Improve TLS codegen by marking the panic/init path as cold#143511
rust-bors[bot] merged 5 commits into
rust-lang:mainfrom
orlp:tls-cold-init

Conversation

@orlp

@orlp orlp commented Jul 5, 2025

Copy link
Copy Markdown
Contributor

View all comments

This is an extension of the performance improvements seen from #141685. I noticed that the non-const TLS still didn't have the #[cold] attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (init) or never (panic, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to #[cold] I added the more aggressive #[inline(never)] to both cold paths as well.

@rustbot

rustbot commented Jul 5, 2025

Copy link
Copy Markdown
Collaborator

r? @workingjubilee

rustbot has assigned @workingjubilee.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jul 5, 2025
@compiler-errors

Copy link
Copy Markdown
Contributor

Not sure if this will show up at all on perf but 🤷

@bors2 try @rust-timer queue

Do you have any local benchmarks?

@rust-timer

This comment has been minimized.

rust-bors Bot added a commit that referenced this pull request Jul 5, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 5, 2025
@rust-bors

rust-bors Bot commented Jul 5, 2025

Copy link
Copy Markdown
Contributor

⌛ Trying commit db7b096 with merge 9f2c18a

To cancel the try build, run the command @bors2 try cancel.

@orlp

orlp commented Jul 5, 2025

Copy link
Copy Markdown
Contributor Author

@compiler-errors No I don't have any local benchmarks. But I look at assembly output a lot, and trust me when I say these code paths should never get inlined.

Could you restart the benchmark with my second commit included?

@compiler-errors

Copy link
Copy Markdown
Contributor

@bors2 try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

rust-bors Bot commented Jul 5, 2025

Copy link
Copy Markdown
Contributor

⌛ Trying commit cf4669e with merge 8b17150

(The previously running try build was automatically cancelled.)

To cancel the try build, run the command @bors2 try cancel.

rust-bors Bot added a commit that referenced this pull request Jul 5, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rust-bors

rust-bors Bot commented Jul 6, 2025

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: 8b17150 (8b17150009e237f23856ea93eb9b208049d8a621, parent: 175e04331be56c5b4bdf77478434b1a5e0556770)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (8b17150): comparison URL.

Overall result: ❌✅ regressions and improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 1
Improvements ✅
(primary)
-0.3% [-0.3%, -0.3%] 1
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 1
All ❌✅ (primary) -0.3% [-0.3%, -0.3%] 1

Max RSS (memory usage)

Results (primary 5.4%, secondary 2.4%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
5.4% [4.3%, 7.1%] 3
Regressions ❌
(secondary)
2.4% [2.4%, 2.4%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 5.4% [4.3%, 7.1%] 3

Cycles

Results (primary 2.6%, secondary -2.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.6% [2.6%, 2.6%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.8% [-2.8%, -2.8%] 1
All ❌✅ (primary) 2.6% [2.6%, 2.6%] 1

Binary size

Results (primary 0.0%, secondary 0.1%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 0.5%] 15
Regressions ❌
(secondary)
0.1% [0.0%, 0.1%] 37
Improvements ✅
(primary)
-0.2% [-0.7%, -0.0%] 5
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [-0.7%, 0.5%] 20

Bootstrap: 459.09s -> 461.518s (0.53%)
Artifact size: 372.18 MiB -> 372.13 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
@orlp

orlp commented Jul 6, 2025

Copy link
Copy Markdown
Contributor Author

I removed some inline(never)s because they pessimized codegen. I had forgotten that the get() call which returns the TLS pointer still gets wrapped inside LocalKey and checked again to see if a panic is required. Now this PR only adds hot paths with #[cold] for the fallback.

Codegen is still nicer just due to the addition of #[cold], it moves the initialization out of the hot path at least (and the compiler may still decide to not inline it).

@lqd

lqd commented Jul 6, 2025

Copy link
Copy Markdown
Member

@bors2 try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

rust-bors Bot commented Jul 6, 2025

Copy link
Copy Markdown
Contributor

⌛ Trying commit 92fa8e8 with merge 9782d0a

To cancel the try build, run the command @bors2 try cancel.

rust-bors Bot added a commit that referenced this pull request Jul 6, 2025
Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
@rust-bors

rust-bors Bot commented Jul 6, 2025

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: 9782d0a (9782d0a1d99759de86b20e0863061637a0a3c245, parent: c83e217d268d25960a0c79c6941bcb3917a6a0af)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (9782d0a): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

Our most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.3% [-0.3%, -0.3%] 2
All ❌✅ (primary) - - 0

Max RSS (memory usage)

This benchmark run did not return any relevant results for this metric.

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.0%, secondary 0.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.0% [0.0%, 0.0%] 1
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 9
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-0.0% [-0.0%, -0.0%] 1
All ❌✅ (primary) 0.0% [0.0%, 0.0%] 1

Bootstrap: 461.809s -> 462.209s (0.09%)
Artifact size: 372.19 MiB -> 372.13 MiB (-0.02%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 6, 2025
Comment thread library/std/src/sys/thread_local/native/eager.rs
Comment thread library/std/src/sys/thread_local/native/eager.rs Outdated
/// The resulting pointer may not be used after reentrant inialialization
/// or thread destruction has occurred.
#[inline]
pub fn get(&'static self, i: Option<&mut Option<T>>, f: impl FnOnce() -> T) -> *const T {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, I think it might be beneficial to inline the ptr.addr() == 1 case into this function, as that might yield more optimized LocalKey::withs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, see my other comment.

Comment on lines +33 to +36
if let State::Alive = self.state.get() {
self.val.get()
} else {
unsafe { self.get_or_init_slow() }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is beneficial – the returned pointer is later compared against null in LocalKey::with anyway, so the optimiser should be able to merge the state comparison into that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is beneficial. I think anything that is not the initialized path should be marked cold and gotten out of the way, even if that makes the non-initialized path slightly slower and have duplicated work.

The initialized hot path is what matters 99.999% of the time and should be prioritized over all else.

@orlp orlp Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this toy example to illustrate this: https://rust.godbolt.org/z/hd6hnGWGh.

Note that because UnsafeCell::get cannot return a null pointer, the fast-path once inlined completely eliminates the nullptr check and only checks the state.

@orlp orlp force-pushed the tls-cold-init branch from 92fa8e8 to a7790c6 Compare June 5, 2026 10:55
@rustbot

rustbot commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@orlp

orlp commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Sorry for the delay. I've rebased on the latest main and addressed the review comments (either solving their concern or contesting).

I've also made one small change, I've explicitly assigned discriminant 0 to the Alive state which can give faster code on Arm to check (directly cbz instead of cmp first).

@rustbot label -S-waiting-on-author +S-waiting-on-review

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 5, 2026
@joboet

joboet commented Jun 7, 2026

Copy link
Copy Markdown
Member

I tested the performance of my idea, and it truly appears to be worse.

Let's do a final perf-run of this, and then this should be good to go.

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 7, 2026
@rust-bors

This comment has been minimized.

rust-bors Bot pushed a commit that referenced this pull request Jun 7, 2026
Improve TLS codegen by marking the panic/init path as cold
@rust-bors

rust-bors Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

☀️ Try build successful (CI)
Build commit: 4fa0244 (4fa024445d244dbe0860ff6c1500d718a02ec239, parent: 43a4909ee98ed4d006d9d773f5d94dc58e34f846)

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (4fa0244): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up.

@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This perf run didn't have relevant results for this metric.

Max RSS (memory usage)

Results (secondary -4.9%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-4.9% [-4.9%, -4.9%] 1
All ❌✅ (primary) - - 0

Cycles

This perf run didn't have relevant results for this metric.

Binary size

This perf run didn't have relevant results for this metric.

Bootstrap: 516.093s -> 514.238s (-0.36%)
Artifact size: 400.83 MiB -> 401.34 MiB (0.13%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 7, 2026
@joboet

joboet commented Jun 8, 2026

Copy link
Copy Markdown
Member

Well, that's underwhelming...

But I think I'll merge this anyway, it makes the implementations more consistent and is nice to have in general...

@bors r+ rollup

@rust-bors

rust-bors Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

📌 Commit 2de39c0 has been approved by joboet

It is now in the queue for this repository.

@rust-bors rust-bors Bot added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 8, 2026
rust-bors Bot pushed a commit that referenced this pull request Jun 8, 2026
…uwer

Rollup of 13 pull requests

Successful merges:

 - #147302 (asm! support for the Xtensa architecture)
 - #148820 (Add very basic "comptime" fn implementation)
 - #157299 (Fix unstable diagnostics in tests)
 - #143511 (Improve TLS codegen by marking the panic/init path as cold)
 - #154608 (Add `_value` API for number literals in proc-macro)
 - #156762 (xfs support in `test_rename_directory_to_non_empty_directory`)
 - #157300 (Relax test requirements for consistency)
 - #157383 (tests: codegen-llvm: Ignore BPF targets in c-variadic-opt)
 - #157413 (fix: don't suggest .into_iter() for .cloned()/.copied() on non-reference Option)
 - #157578 (Fix diagnostics for non-exhaustive destructuring assignments (#157553))
 - #157587 (explain that the size_of constant also serves to avoid optimizing away 'unused' size_of calls)
 - #157596 (test: remove ineffective link-extern-crate-with-drop-type test)
 - #157602 (rustdoc: Remove unnecessary fast path)
@rust-bors rust-bors Bot merged commit 7fc9526 into rust-lang:main Jun 8, 2026
13 checks passed
@rustbot rustbot added this to the 1.98.0 milestone Jun 8, 2026
rust-timer added a commit that referenced this pull request Jun 8, 2026
Rollup merge of #143511 - orlp:tls-cold-init, r=joboet

Improve TLS codegen by marking the panic/init path as cold

This is an extension of the performance improvements seen from <#141685>. I noticed that the non-`const` TLS still didn't have the `#[cold]` attribute for the uninit/panic path, and I also realized that neither implementation should have the initialization or panic path inlined, ever.

These paths are taken either only once per thread (`init`) or never (`panic`, in a well-behaving Rust program), thus they don't deserve to litter the code generated each time you access a thread-local variable. So in addition to `#[cold]` I added the more aggressive `#[inline(never)]` to both cold paths as well.
@JonathanBrouwer

Copy link
Copy Markdown
Contributor

@rust-timer build 6f652bc

@rust-timer

This comment has been minimized.

@rust-timer

Copy link
Copy Markdown
Collaborator

Finished benchmarking commit (6f652bc): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking means the PR may be perf-sensitive. Consider adding rollup=never if this change is not fit for rolling up.

@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This perf run didn't have relevant results for this metric.

Max RSS (memory usage)

Results (primary -1.3%, secondary 6.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.8% [2.8%, 2.8%] 1
Regressions ❌
(secondary)
6.8% [6.8%, 6.8%] 1
Improvements ✅
(primary)
-5.4% [-5.4%, -5.4%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.3% [-5.4%, 2.8%] 2

Cycles

Results (secondary -0.5%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
3.6% [3.6%, 3.6%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-4.6% [-4.6%, -4.6%] 1
All ❌✅ (primary) - - 0

Binary size

Results (primary 0.0%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
0.0% [0.0%, 0.0%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [0.0%, 0.0%] 3

Bootstrap: 517.572s -> 515.951s (-0.31%)
Artifact size: 400.85 MiB -> 400.77 MiB (-0.02%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants