[wasm] EmccCompile: Improve AOT time by better utilizing the cores#67195
[wasm] EmccCompile: Improve AOT time by better utilizing the cores#67195radical merged 2 commits intodotnet:mainfrom
Conversation
Problem: `EmccCompile` tasks compiles `.bc` files to `.o` files, and uses `Parallel.ForEach` to run `emcc` for these in parallel. The problem manifests when `EmccCompile` is compiling lot of files. - To start with, the intended number of cores are being used - but at some point (in my case after ~150 out of 180 files), the number of cores being utilized goes down to 1. - And the reason is that `Parallel.ForEach` partitions the list of files(jobs), and they execute only the assigned jobs From: dotnet#46146 (comment) Stephen Toub: "As such, by default ForEach works on a scheme whereby each thread takes one item each time it goes back to the enumerator, and then after a few times of this upgrades to taking two items each time it goes back to the enumerator, and then four, and then eight, and so on. This ammortizes the cost of taking and releasing the lock across multiple items, while still enabling parallelization for enumerables containing just a few items. It does, however, mean that if you've got a case where the body takes a really long time and the work for every item is heterogeneous, you can end up with an imbalance." The above means that with wildy different times taken by each job, we can end up in this imbalance, leading to some cores being idle, which others get reduced to running jobs sequentially. Instead, we want to use work-stealing so jobs can be run by any partition. In my highly unscientific testing, with AOT for `System.Buffers.Tests`, the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins to 4.0mins .
|
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsProblem:
The problem manifests when
From: #46146 (comment) Stephen Toub: The above means that with wildy different times taken by each job, we Instead, we want to use work-stealing so jobs can be run by any partition. In my highly unscientific testing, with AOT for
|
|
/azp run runtime-wasm |
|
Azure Pipelines successfully started running 1 pipeline(s). |
akoeplinger
left a comment
There was a problem hiding this comment.
LGTM! Should we do the same for the Parallel.ForEach in MonoAOTCompiler.cs? it could run into the same problem I think
I was thinking about that, but |
.. work-stealing, instead of being partitioned.
|
/azp run runtime-extra-platforms |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
The unrelated libraries test failures on |
…otnet#67195) * [wasm] EmccCompile: Improve AOT time by better utilizing the cores Problem: `EmccCompile` tasks compiles `.bc` files to `.o` files, and uses `Parallel.ForEach` to run `emcc` for these in parallel. The problem manifests when `EmccCompile` is compiling lot of files. - To start with, the intended number of cores are being used - but at some point (in my case after ~150 out of 180 files), the number of cores being utilized goes down to 1. - And the reason is that `Parallel.ForEach` partitions the list of files(jobs), and they execute only the assigned jobs From: dotnet#46146 (comment) Stephen Toub: "As such, by default ForEach works on a scheme whereby each thread takes one item each time it goes back to the enumerator, and then after a few times of this upgrades to taking two items each time it goes back to the enumerator, and then four, and then eight, and so on. This ammortizes the cost of taking and releasing the lock across multiple items, while still enabling parallelization for enumerables containing just a few items. It does, however, mean that if you've got a case where the body takes a really long time and the work for every item is heterogeneous, you can end up with an imbalance." The above means that with wildy different times taken by each job, we can end up in this imbalance, leading to some cores being idle, which others get reduced to running jobs sequentially. Instead, we want to use work-stealing so jobs can be run by any partition. In my highly unscientific testing, with AOT for `System.Buffers.Tests`, the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins to 4.0mins . * MonoAOTCompiler.cs: Ensure that the parallel jobs get scheduled with .. work-stealing, instead of being partitioned.
Problem:
EmccCompiletasks compiles.bcfiles to.ofiles, and usesParallel.ForEachto runemccfor these in parallel.The problem manifests when
EmccCompileis compiling lot of files.of cores being utilized goes down to 1.
Parallel.ForEachpartitions the list offiles(jobs), and they execute only the assigned jobs
From: #46146 (comment)
The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.
Instead, we want to use work-stealing so jobs can be run by any partition.
In my highly unscientific testing, with AOT for
System.Buffers.Tests,the total time to run
EmccCompilefor 181 assemblies goes from 5.7minsto 4.0mins .