Skip to content

Commit d347058

Browse files
radicalradekdoulik
authored andcommitted
[wasm] EmccCompile: Improve AOT time by better utilizing the cores (dotnet#67195)
* [wasm] EmccCompile: Improve AOT time by better utilizing the cores Problem: `EmccCompile` tasks compiles `.bc` files to `.o` files, and uses `Parallel.ForEach` to run `emcc` for these in parallel. The problem manifests when `EmccCompile` is compiling lot of files. - To start with, the intended number of cores are being used - but at some point (in my case after ~150 out of 180 files), the number of cores being utilized goes down to 1. - And the reason is that `Parallel.ForEach` partitions the list of files(jobs), and they execute only the assigned jobs From: dotnet#46146 (comment) Stephen Toub: "As such, by default ForEach works on a scheme whereby each thread takes one item each time it goes back to the enumerator, and then after a few times of this upgrades to taking two items each time it goes back to the enumerator, and then four, and then eight, and so on. This ammortizes the cost of taking and releasing the lock across multiple items, while still enabling parallelization for enumerables containing just a few items. It does, however, mean that if you've got a case where the body takes a really long time and the work for every item is heterogeneous, you can end up with an imbalance." The above means that with wildy different times taken by each job, we can end up in this imbalance, leading to some cores being idle, which others get reduced to running jobs sequentially. Instead, we want to use work-stealing so jobs can be run by any partition. In my highly unscientific testing, with AOT for `System.Buffers.Tests`, the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins to 4.0mins . * MonoAOTCompiler.cs: Ensure that the parallel jobs get scheduled with .. work-stealing, instead of being partitioned.
1 parent fc4d4ea commit d347058

2 files changed

Lines changed: 55 additions & 2 deletions

File tree

src/tasks/AotCompilerTask/MonoAOTCompiler.cs

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -419,8 +419,34 @@ private bool ExecuteInternal()
419419
if (BuildEngine is IBuildEngine9 be9)
420420
allowedParallelism = be9.RequestCores(allowedParallelism);
421421

422+
/*
423+
From: https://github.com/dotnet/runtime/issues/46146#issuecomment-754021690
424+
425+
Stephen Toub:
426+
"As such, by default ForEach works on a scheme whereby each
427+
thread takes one item each time it goes back to the enumerator,
428+
and then after a few times of this upgrades to taking two items
429+
each time it goes back to the enumerator, and then four, and
430+
then eight, and so on. This ammortizes the cost of taking and
431+
releasing the lock across multiple items, while still enabling
432+
parallelization for enumerables containing just a few items. It
433+
does, however, mean that if you've got a case where the body
434+
takes a really long time and the work for every item is
435+
heterogeneous, you can end up with an imbalance."
436+
437+
The time taken by individual compile jobs here can vary a
438+
lot, depending on various factors like file size. This can
439+
create an imbalance, like mentioned above, and we can end up
440+
in a situation where one of the partitions has a job that
441+
takes very long to execute, by which time other partitions
442+
have completed, so some cores are idle. But the the idle
443+
ones won't get any of the remaining jobs, because they are
444+
all assigned to that one partition.
445+
446+
Instead, we want to use work-stealing so jobs can be run by any partition.
447+
*/
422448
ParallelLoopResult result = Parallel.ForEach(
423-
argsList,
449+
Partitioner.Create(argsList, EnumerablePartitionerOptions.NoBuffering),
424450
new ParallelOptions { MaxDegreeOfParallelism = allowedParallelism },
425451
(args, state) => PrecompileLibraryParallel(args, state));
426452

src/tasks/WasmAppBuilder/EmccCompile.cs

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,34 @@ private bool ExecuteActual()
129129
if (BuildEngine is IBuildEngine9 be9)
130130
allowedParallelism = be9.RequestCores(allowedParallelism);
131131

132-
ParallelLoopResult result = Parallel.ForEach(filesToCompile,
132+
/*
133+
From: https://github.com/dotnet/runtime/issues/46146#issuecomment-754021690
134+
135+
Stephen Toub:
136+
"As such, by default ForEach works on a scheme whereby each
137+
thread takes one item each time it goes back to the enumerator,
138+
and then after a few times of this upgrades to taking two items
139+
each time it goes back to the enumerator, and then four, and
140+
then eight, and so on. This ammortizes the cost of taking and
141+
releasing the lock across multiple items, while still enabling
142+
parallelization for enumerables containing just a few items. It
143+
does, however, mean that if you've got a case where the body
144+
takes a really long time and the work for every item is
145+
heterogeneous, you can end up with an imbalance."
146+
147+
The time taken by individual compile jobs here can vary a
148+
lot, depending on various factors like file size. This can
149+
create an imbalance, like mentioned above, and we can end up
150+
in a situation where one of the partitions has a job that
151+
takes very long to execute, by which time other partitions
152+
have completed, so some cores are idle. But the the idle
153+
ones won't get any of the remaining jobs, because they are
154+
all assigned to that one partition.
155+
156+
Instead, we want to use work-stealing so jobs can be run by any partition.
157+
*/
158+
ParallelLoopResult result = Parallel.ForEach(
159+
Partitioner.Create(filesToCompile, EnumerablePartitionerOptions.NoBuffering),
133160
new ParallelOptions { MaxDegreeOfParallelism = allowedParallelism },
134161
(toCompile, state) =>
135162
{

0 commit comments

Comments
 (0)