Skip to content

[wasm] EmccCompile: Improve AOT time by better utilizing the cores#67195

Merged
radical merged 2 commits intodotnet:mainfrom
radical:fix-emcc-compile
Mar 28, 2022
Merged

[wasm] EmccCompile: Improve AOT time by better utilizing the cores#67195
radical merged 2 commits intodotnet:mainfrom
radical:fix-emcc-compile

Conversation

@radical
Copy link
Member

@radical radical commented Mar 27, 2022

Problem:

EmccCompile tasks compiles .bc files to .o files, and uses
Parallel.ForEach to run emcc for these in parallel.

The problem manifests when EmccCompile is compiling lot of files.

  • To start with, the intended number of cores are being used
  • but at some point (in my case after ~150 out of 180 files), the number
    of cores being utilized goes down to 1.
  • And the reason is that Parallel.ForEach partitions the list of
    files(jobs), and they execute only the assigned jobs

From: #46146 (comment)

Stephen Toub:
    "As such, by default ForEach works on a scheme whereby each
    thread takes one item each time it goes back to the enumerator,
    and then after a few times of this upgrades to taking two items
    each time it goes back to the enumerator, and then four, and
    then eight, and so on. This ammortizes the cost of taking and
    releasing the lock across multiple items, while still enabling
    parallelization for enumerables containing just a few items. It
    does, however, mean that if you've got a case where the body
    takes a really long time and the work for every item is
    heterogeneous, you can end up with an imbalance."

The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.

Instead, we want to use work-stealing so jobs can be run by any partition.

In my highly unscientific testing, with AOT for System.Buffers.Tests,
the total time to run EmccCompile for 181 assemblies goes from 5.7mins
to 4.0mins .

Problem:

`EmccCompile` tasks compiles `.bc` files to `.o` files, and uses
`Parallel.ForEach` to run `emcc` for these in parallel.

The problem manifests when `EmccCompile` is compiling lot of files.
- To start with, the intended number of cores are being used
- but at some point (in my case after ~150 out of 180 files), the number
  of cores being utilized goes down to 1.
- And the reason is that `Parallel.ForEach` partitions the list of
  files(jobs), and they execute only the assigned jobs

From: dotnet#46146 (comment)

Stephen Toub:
    "As such, by default ForEach works on a scheme whereby each
    thread takes one item each time it goes back to the enumerator,
    and then after a few times of this upgrades to taking two items
    each time it goes back to the enumerator, and then four, and
    then eight, and so on. This ammortizes the cost of taking and
    releasing the lock across multiple items, while still enabling
    parallelization for enumerables containing just a few items. It
    does, however, mean that if you've got a case where the body
    takes a really long time and the work for every item is
    heterogeneous, you can end up with an imbalance."

The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.

Instead, we want to use work-stealing so jobs can be run by any partition.

In my highly unscientific testing, with AOT for `System.Buffers.Tests`,
the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins
to 4.0mins .
@radical radical added arch-wasm WebAssembly architecture area-Build-mono labels Mar 27, 2022
@ghost ghost assigned radical Mar 27, 2022
@ghost
Copy link

ghost commented Mar 27, 2022

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Problem:

EmccCompile tasks compiles .bc files to .o files, and uses
Parallel.ForEach to run emcc for these in parallel.

The problem manifests when EmccCompile is compiling lot of files.

  • To start with, the intended number of cores are being used
  • but at some point (in my case after ~150 out of 180 files), the number
    of cores being utilized goes down to 1.
  • And the reason is that Parallel.ForEach partitions the list of
    files(jobs), and they execute only the assigned jobs

From: #46146 (comment)

Stephen Toub:
"As such, by default ForEach works on a scheme whereby each
thread takes one item each time it goes back to the enumerator,
and then after a few times of this upgrades to taking two items
each time it goes back to the enumerator, and then four, and
then eight, and so on. This ammortizes the cost of taking and
releasing the lock across multiple items, while still enabling
parallelization for enumerables containing just a few items. It
does, however, mean that if you've got a case where the body
takes a really long time and the work for every item is
heterogeneous, you can end up with an imbalance."

The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.

Instead, we want to use work-stealing so jobs can be run by any partition.

In my highly unscientific testing, with AOT for System.Buffers.Tests,
the total time to run EmccCompile for 181 assemblies goes from 5.7mins
to 4.0mins .

Author: radical
Assignees: -
Labels:

arch-wasm, area-Build-mono

Milestone: -

@radical
Copy link
Member Author

radical commented Mar 27, 2022

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical radical removed the request for review from stephentoub March 27, 2022 00:12
Copy link
Member

@akoeplinger akoeplinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should we do the same for the Parallel.ForEach in MonoAOTCompiler.cs? it could run into the same problem I think

@radical
Copy link
Member Author

radical commented Mar 28, 2022

LGTM! Should we do the same for the Parallel.ForEach in MonoAOTCompiler.cs? it could run into the same problem I think

I was thinking about that, but mono-aot-cross seems to be very quick. But it's still launching a process, so maybe it will make sense there too. I can add it in a follow up PR.

.. work-stealing, instead of being partitioned.
@radical
Copy link
Member Author

radical commented Mar 28, 2022

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical
Copy link
Member Author

radical commented Mar 28, 2022

The unrelated libraries test failures on runtime are #60962 .

@radical radical merged commit d69bd06 into dotnet:main Mar 28, 2022
@radical radical deleted the fix-emcc-compile branch March 28, 2022 19:48
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
…otnet#67195)

* [wasm] EmccCompile: Improve AOT time by better utilizing the cores

Problem:

`EmccCompile` tasks compiles `.bc` files to `.o` files, and uses
`Parallel.ForEach` to run `emcc` for these in parallel.

The problem manifests when `EmccCompile` is compiling lot of files.
- To start with, the intended number of cores are being used
- but at some point (in my case after ~150 out of 180 files), the number
  of cores being utilized goes down to 1.
- And the reason is that `Parallel.ForEach` partitions the list of
  files(jobs), and they execute only the assigned jobs

From: dotnet#46146 (comment)

Stephen Toub:
    "As such, by default ForEach works on a scheme whereby each
    thread takes one item each time it goes back to the enumerator,
    and then after a few times of this upgrades to taking two items
    each time it goes back to the enumerator, and then four, and
    then eight, and so on. This ammortizes the cost of taking and
    releasing the lock across multiple items, while still enabling
    parallelization for enumerables containing just a few items. It
    does, however, mean that if you've got a case where the body
    takes a really long time and the work for every item is
    heterogeneous, you can end up with an imbalance."

The above means that with wildy different times taken by each job, we
can end up in this imbalance, leading to some cores being idle, which
others get reduced to running jobs sequentially.

Instead, we want to use work-stealing so jobs can be run by any partition.

In my highly unscientific testing, with AOT for `System.Buffers.Tests`,
the total time to run `EmccCompile` for 181 assemblies goes from 5.7mins
to 4.0mins .

* MonoAOTCompiler.cs: Ensure that the parallel jobs get scheduled with

.. work-stealing, instead of being partitioned.
@ghost ghost locked as resolved and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-wasm WebAssembly architecture area-Build-mono

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants