Replace sched_yield with usleep(1) on Linux#1051
Conversation
|
@martin-frbg Do you measure the overhead of sched_yield or usleep(1) ? |
|
Unfortunately I do not have access to any serious HPC system (and I have to agree the two issue threads I refered to are a bit confusing, with attempts to modify the GEMM thresholds intermixed with the changes to YIELDING ). On my lowly dual-core Kaby Lake laptop I do see a significant increase in throughput with the (improved) deig.R benchmark (2k x 2k going from 8000Mflops/28sec to 24000/9sec) though most of that will probably be due to avoiding thermal throttling. |
|
@xianyi just start idle CPU hog e.g.: |
|
I'll see if I can retask at least a 4-core Haswell to benchmarking tomorrow. |
|
First results (still with deig.R, on a quadcore Haswell with turboboost disabled to ensure repeatability) show a slight slowdown rather than the 10 percent speedup reported for fenrus75' test case. The change however does allow the cores to drop to "idle" intermittently, where with sched_yield the core allocation was split between userspace and system time - this is probably what helps thermal management and would make the calculation run faster with dynamic overclocking enabled. |
|
for me YIELDING {} gave best result with 0.1um speedup |
|
Interestingly an empty YIELDING is what was used in early versions of libGoto2 up to at least 1.08, while 1.13 had sched_yield. |
|
Some preliminary data (more to come next weekend if time and workload permits). deig.R benchmark |
|
you can cat() whole z: z[1] would be user time (that makes flops) and z[2] system time (that generates heat) |
|
From sched_yield(2) manual page: |
|
Yeah we can all read manpages - my problem is when would a call to sched_yield clearly be considered unnecessary, inappropriate and indecent, and why would K.Goto and all who came after him decide this was not the case here ? (Note libGoto2-1.13 was released well after the linux kernel got the current sched_yield semantics - of course it could be that the man was using some flavor of *BSD or something else with lightweight sched_yield at the time) |
|
PR updated to current favorite just to prevent accidental merging of known bad idea. |
|
Did not get around to much more yet, so no complete picture. Quick comparison of deig.R run with 4 threads and 10240x10240 matrix, i7-4770 with and without turboboost enabled: |
|
Seems that slack turns up for small samples more without impacting overall time. |
|
Just remember that |
|
Thanks for that comment - indeed I am still looking for reasons why sched_yield was preferred so far. If I am not mistaken, a default build of OpenBLAS will have a built-in thread limit of two times the number of cores on x86 at least and so far the behaviour of a "lightly" overloaded system without sched_yield did not appear to be worse. (And I did see a clear improvement in terms of thermal management with my - apparently poorly designed - early Kaby Lake laptop). So perhaps it would be sufficient to provide a choice of implementations for "YIELDING" in Makefile.rule or couple it to the NUM_THREADS value in relation to physical cores detected ? (Lastly there is always the option to limit the number of threads to some managable quantity for the given system at runtime) |
|
Closing this for now as the results so far are a bit inconclusive and I need to get rid of my current fork as doing subsequent PRs from it seems to have led to an unintended revert of #988 |
to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923