Skip to content

Speed up compilation of large constant arrays#51833

Merged
bors merged 9 commits into
rust-lang:masterfrom
wesleywiser:faster_large_constant_arrays
Jul 1, 2018
Merged

Speed up compilation of large constant arrays#51833
bors merged 9 commits into
rust-lang:masterfrom
wesleywiser:faster_large_constant_arrays

Conversation

@wesleywiser

@wesleywiser wesleywiser commented Jun 27, 2018

Copy link
Copy Markdown
Member

This is a different approach to #51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.

With this change, the toy program goes from 63 seconds to 19 seconds on my machine.

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

Edit2: Now down to 2.8 seconds.

r? @oli-obk

cc @nnethercote @eddyb

This is a different approach to rust-lang#51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 27, 2018
This save an additional 6 seconds on the test program.
@nnethercote

Copy link
Copy Markdown
Contributor

Nice work!

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

For the first patch in #51672 I inlined several methods in addition to bytes(). This was based on profiling data -- while bytes() is clearly the hottest, the others were still hot, and I think it would be worth trying to inline all of them to see if you get a bigger speed-up.

@oli-obk oli-obk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some awesome improvements! After the review is addressed we'll throw it into the perf tests.

self.copy_undef_mask(src, dest, size)?;
self.copy_undef_mask(src, dest, size * length)?;
// copy back the relocations
self.get_mut(dest.alloc_id)?.relocations.insert_presorted(relocations);

@oli-obk oli-obk Jun 27, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to reapeat this, too (and offset the indices).

Try a [&FOO; 500] (for non-ZST FOO) and then access any field but the first (at compile-time! at runtime you'll get a segfault). If I'm reading the code correctly this will tell you about a dangling pointer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! Can you double check my math?

Comment thread src/librustc_mir/interpret/memory.rs Outdated
}

self.copy_undef_mask(src, dest, size)?;
self.copy_undef_mask(src, dest, size * length)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this results in the correct result, it does n^2/2 copies instead of n copies. Inside the function itself we should probably move the self.get(src.alloc_id)? out of the loops, too. We can probably improve the nonoverlapping case enormously, too by not requiring an intermediate allocation.

@michaelwoerister

michaelwoerister commented Jun 27, 2018

Copy link
Copy Markdown
Member

It would be interesting to see if this could be further sped up by copying larger chunks of memory at a time. Right now this makes n calls to memcpy but memcpy can be more efficient when it works with more memory at once (via SIMD). We could try increase the chunk size and see what that does to performance. In particular copy [0 .. size] to [size .. size *2], then [0 .. size*2] to [size*2 .. size*4], then [0..size*4] to [size*4 .. size*8], etc, up to a maximum chunk size. This makes copy_repeatedly quite a bit more complicated but it might be worth it?

This save 3 seconds on the test program.
@scottmcm

Copy link
Copy Markdown
Member

increase the chunk size

Looks like the logic for that already exists in core:

// `2^expn` repetition is done by doubling `buf` `expn`-times.

Maybe there's a way to extract the copying part into an unsafe ptr method and re-use it here?

@michaelwoerister

Copy link
Copy Markdown
Member

@scottmcm Great find!

@wesleywiser

Copy link
Copy Markdown
Member Author

@nnethercote That's a good idea. I tried that and it saved an additional 3 seconds on the compile time (now 10 seconds).

This saves 4.5 seconds and takes the compile time down to 5.5 seconds.
This saves 0.5 seconds on the test compilation.
This saves 2.5 seconds on the test program.
@wesleywiser

wesleywiser commented Jun 30, 2018

Copy link
Copy Markdown
Member Author

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Edit: I can push that commit too if you think it's worth pursuing.

@wesleywiser

Copy link
Copy Markdown
Member Author

With the latest changes, the compile time is down to 2.8 seconds on my machine.

@nnethercote

Copy link
Copy Markdown
Contributor

With the latest changes, the compile time is down to 2.8 seconds on my machine.

A 22.5x speedup!

@oli-obk oli-obk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome improvements. Just a nit so miri the tool keeps working.

@@ -882,25 +882,16 @@ impl<'a, 'mir, 'tcx, M: Machine<'mir, 'tcx>> Memory<'a, 'mir, 'tcx, M> {
) -> EvalResult<'tcx> {
// The bits have to be saved locally before writing to dest in case src and dest overlap.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me think that we should not do this commit, otherwise we'll run into trouble in the future (and in miri right now). Can you do an if for whether there is overlap and if there is, just run the old code?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I preserved the existing behavior by cloning the source allocation's undef_mask before writing to the destination's. Is that sufficient?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right. sorry. I misread the code.

I still think the code isn't doing the right thing. It's only copying once, when it should be copying N-1 times.

You can try this out by creating an array of types with padding, everything starting at the third element will probably not have undef masks for the padding. (you'll need unions to get the bits and then attempt to use them for an array length to actually get a compiler error from that)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not quite following. We do call this function with size * length so shouldn't it cover all of the repeated copies? Can you provide a sample program that will fail?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are using the length, but that just means that the entire array is copied from 0..N to 1..=N, not that the 1st element is copied N times.

I'll make a regression test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain that the following test will succeed to compile on your PR: http://play.rust-lang.org/?gist=1d0183fcfb65164d1ca58ccd9614c33c

);
}
for i in 0..size.bytes() {
let defined = undef_mask.get(src.offset + Size::from_bytes(i));

@oli-obk oli-obk Jun 30, 2018

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass a repeat counter to the function, you should be able to just modulo the i here over the size and have the for loop go from 0 ot size.bytes() * repeat

@oli-obk

oli-obk commented Jun 30, 2018

Copy link
Copy Markdown
Contributor

Please also add a test for http://play.rust-lang.org/?gist=1c0e90ac9064edfa12fbd286902e20ef to make sure we always properly copy the relocations.

@wesleywiser

Copy link
Copy Markdown
Member Author

Added tests and fixed that issue

@rust-highfive

Copy link
Copy Markdown
Contributor

The job x86_64-gnu-llvm-3.9 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

[00:05:01] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:05:01] tidy error: /checkout/src/test/compile-fail/const-err4.rs: missing trailing newline
[00:05:02] some tidy checks failed
[00:05:02] 
[00:05:02] 
[00:05:02] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:05:02] 
[00:05:02] 
[00:05:02] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Makefile:79: recipe for target 'tidy' failed
[00:05:02] make: *** [tidy] Error 1

The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:13ca9c5e
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
---
travis_time:end:007a23e0:start=1530387249551729531,finish=1530387249560171319,duration=8441788
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:012b7739
$ head -30 ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
head: cannot open ‘./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers’ for reading: No such file or directory
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:009c4c44
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@wesleywiser wesleywiser force-pushed the faster_large_constant_arrays branch from 7c64a63 to 46512e0 Compare July 1, 2018 14:32
@wesleywiser

Copy link
Copy Markdown
Member Author

Fixed tidy

@oli-obk

oli-obk commented Jul 1, 2018

Copy link
Copy Markdown
Contributor

@bors r+

We should probably add a bunch of tests to the perf tests to ensure this doesn't regress.

@bors

bors commented Jul 1, 2018

Copy link
Copy Markdown
Collaborator

📌 Commit 46512e0 has been approved by oli-obk

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 1, 2018
@bors

bors commented Jul 1, 2018

Copy link
Copy Markdown
Collaborator

⌛ Testing commit 46512e0 with merge a2be769...

bors added a commit that referenced this pull request Jul 1, 2018
…i-obk

Speed up compilation of large constant arrays

This is a different approach to #51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.

With this change, the [toy program](https://github.com/rust-lang/rust/blob/c2f4744d2db4e162df824d0bd0b093ba4b351545/src/test/run-pass/mir_heavy_promoted.rs) goes from 63 seconds to 19 seconds on my machine.

Edit: Inlining `Size::bytes()` saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

Edit2: Now down to 2.8 seconds.

r? @oli-obk

cc @nnethercote @eddyb
@bors

bors commented Jul 1, 2018

Copy link
Copy Markdown
Collaborator

☀️ Test successful - status-appveyor, status-travis
Approved by: oli-obk
Pushing a2be769 to master...

@bors bors merged commit 46512e0 into rust-lang:master Jul 1, 2018
@michaelwoerister

Copy link
Copy Markdown
Member

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Thanks for giving it a try, @wesleywiser. Great work!

wesleywiser added a commit that referenced this pull request Oct 3, 2018
In  #51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).
kennytm added a commit to kennytm/rust that referenced this pull request Oct 3, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
emilyalbini added a commit to emilyalbini/rust that referenced this pull request Oct 3, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
emilyalbini added a commit to emilyalbini/rust that referenced this pull request Oct 4, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants