Replace sched_yield with usleep(1) on Linux by martin-frbg · Pull Request #1051 · OpenMathLib/OpenBLAS

martin-frbg · 2017-01-08T22:07:19Z

to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923

xianyi · 2017-01-09T08:20:39Z

@martin-frbg Do you measure the overhead of sched_yield or usleep(1) ?

martin-frbg · 2017-01-09T13:31:29Z

Unfortunately I do not have access to any serious HPC system (and I have to agree the two issue threads I refered to are a bit confusing, with attempts to modify the GEMM thresholds intermixed with the changes to YIELDING ). On my lowly dual-core Kaby Lake laptop I do see a significant increase in throughput with the (improved) deig.R benchmark (2k x 2k going from 8000Mflops/28sec to 24000/9sec) though most of that will probably be due to avoiding thermal throttling.
Note also that wernsaar already replaced sched_yield with asm(nop) for a number of AMD cpus in the past while I do not think the underlying issue is in cpu hardware. Should I open a separate issue to discuss ?

brada4 · 2017-01-09T15:51:37Z

@xianyi just start idle CPU hog e.g.:
nice yes > /dev/null &
Then see how much spare time you get running same sample with different YIELDING options.
It demonstrates more CPU to competing users, or less heat for nothing in absence of such...
I don't get significant boost in wall time spent, but system CPU time consumption drops noticeably, somebody with watt-meter could be better positioned to measure.

martin-frbg · 2017-01-09T22:36:21Z

I'll see if I can retask at least a 4-core Haswell to benchmarking tomorrow.

martin-frbg · 2017-01-10T16:50:01Z

First results (still with deig.R, on a quadcore Haswell with turboboost disabled to ensure repeatability) show a slight slowdown rather than the 10 percent speedup reported for fenrus75' test case. The change however does allow the cores to drop to "idle" intermittently, where with sched_yield the core allocation was split between userspace and system time - this is probably what helps thermal management and would make the calculation run faster with dynamic overclocking enabled.

brada4 · 2017-01-10T20:44:16Z

for me YIELDING {} gave best result with 0.1um speedup

martin-frbg · 2017-01-10T21:14:43Z

Interestingly an empty YIELDING is what was used in early versions of libGoto2 up to at least 1.08, while 1.13 had sched_yield.

martin-frbg · 2017-01-11T19:10:52Z

Some preliminary data (more to come next weekend if time and workload permits). deig.R benchmark
modified for matrix size 10240x10240, four threads on otherwise idle quadcore Haswell 3.4GHz w/o turboboost, YIELDING defined as:
sched_yield 62626.0 MFlops 457.1 sec
usleep(1) 56996.8 MFlops 502.2 sec
nothing 62442.8 MFlops 458.4 sec
8xnop 62759.1 MFlops 456.1 sec
So usleep(1) appears to be significantly slower, the other alternatives are basically on par in terms of speed but as mentioned above sched_yield keeps the cores from going idle. Results for 2 and 8 threads are similar to this (though I only have them for small matrix sizes up to 2048x2048)

brada4 · 2017-01-11T20:06:24Z

you can cat() whole z: z[1] would be user time (that makes flops) and z[2] system time (that generates heat)
2k2k8B=32MB, probably dont try at NUMA...
I think actual purpose of sched_yield is to make cores idle, but I am not sure.
My googe research also shows that heavy sched_yield is product of kernel 2.6, which is oldest in the wild for all practical purposes.

brada4 · 2017-01-11T20:24:46Z

From sched_yield(2) manual page:
Strategic calls to sched_yield() can improve performance by giving other threads or processes a chance to run when (heavily) contended resources (e.g., mutexes) have been released by the caller. Avoid calling sched_yield() unnecessarily or inappropriately
(e.g., when resources needed by other schedulable threads are still held by the caller), since doing so will result in unnecessary context switches, which will degrade system performance.

martin-frbg · 2017-01-11T21:40:17Z

Yeah we can all read manpages - my problem is when would a call to sched_yield clearly be considered unnecessary, inappropriate and indecent, and why would K.Goto and all who came after him decide this was not the case here ? (Note libGoto2-1.13 was released well after the linux kernel got the current sched_yield semantics - of course it could be that the man was using some flavor of *BSD or something else with lightweight sched_yield at the time)

martin-frbg · 2017-01-11T22:42:06Z

PR updated to current favorite just to prevent accidental merging of known bad idea.

martin-frbg · 2017-01-17T15:13:24Z

Did not get around to much more yet, so no complete picture. Quick comparison of deig.R run with 4 threads and 10240x10240 matrix, i7-4770 with and without turboboost enabled:
sched_yield, fixed freq 62625 MFlops 457 sec turboboost 64515 MFlops 443 sec
asm(8xnop), fixed freq 62759 MFlops 456 sec turboboost 65407 MFlops 437 sec

brada4 · 2017-01-17T21:22:59Z

Seems that slack turns up for small samples more without impacting overall time.

jeffhammond · 2017-01-19T21:07:09Z

Just remember that sched_yield has a significant upside over nop when oversubscribing. It would be a lot better to allow users to select the more forgiving option if they intend to run in a multi-tenant or desktop environment.

martin-frbg · 2017-01-19T21:46:03Z

Thanks for that comment - indeed I am still looking for reasons why sched_yield was preferred so far. If I am not mistaken, a default build of OpenBLAS will have a built-in thread limit of two times the number of cores on x86 at least and so far the behaviour of a "lightly" overloaded system without sched_yield did not appear to be worse. (And I did see a clear improvement in terms of thermal management with my - apparently poorly designed - early Kaby Lake laptop). So perhaps it would be sufficient to provide a choice of implementations for "YIELDING" in Makefile.rule or couple it to the NUM_THREADS value in relation to physical cores detected ? (Lastly there is always the option to limit the number of threads to some managable quantity for the given system at runtime)

martin-frbg · 2017-02-11T14:50:21Z

Closing this for now as the results so far are a bit inconclusive and I need to get rid of my current fork as doing subsequent PRs from it seems to have led to an unintended revert of #988

Replace sched_yield with usleep(1) on Linux

f1db56f

to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923

Replace sched_yield on Linux with nop instruction

fb31c81

martin-frbg closed this Feb 11, 2017

jeffhammond mentioned this pull request Feb 20, 2017

Add thread yield function for spin-loops. flame/blis#82

Closed

brada4 mentioned this pull request May 19, 2017

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Closed

brada4 mentioned this pull request Jun 6, 2017

Problem with OpenBlas and Openmp #1193

Closed

martin-frbg mentioned this pull request Jun 11, 2018

performance in sched_yield (multisocket) #900

Closed

martin-frbg mentioned this pull request Jan 26, 2024

Inquiry and Suggestions Regarding OpenBLAS Code Flow with OpenMP #4418

Open

Conversation

martin-frbg commented Jan 8, 2017

Uh oh!

xianyi commented Jan 9, 2017

Uh oh!

martin-frbg commented Jan 9, 2017

Uh oh!

brada4 commented Jan 9, 2017

Uh oh!

martin-frbg commented Jan 9, 2017

Uh oh!

martin-frbg commented Jan 10, 2017

Uh oh!

brada4 commented Jan 10, 2017

Uh oh!

martin-frbg commented Jan 10, 2017

Uh oh!

martin-frbg commented Jan 11, 2017

Uh oh!

brada4 commented Jan 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brada4 commented Jan 11, 2017

Uh oh!

martin-frbg commented Jan 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Jan 11, 2017

Uh oh!

martin-frbg commented Jan 17, 2017

Uh oh!

brada4 commented Jan 17, 2017

Uh oh!

jeffhammond commented Jan 19, 2017

Uh oh!

martin-frbg commented Jan 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Feb 11, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brada4 commented Jan 11, 2017 •

edited

Loading

martin-frbg commented Jan 11, 2017 •

edited

Loading

martin-frbg commented Jan 19, 2017 •

edited

Loading