Hi!!!
I'm using mumps with openblas 0.2.18 compiled from sources with this options:
- gcc 5.3.0
- USE_THREAD = 1
- NUM_THREADS = 128
- NO_WARMUP = 1
- #NO_AFFINITY = 1
- #BIGNUMA = 1
- MAX_STACK_ALLOC = 8128
I'm running my code in some machines multisocket (2 or 4) with 24 until 128 cores
The problem is that most of the time is executed by the system. To be more acurate is doing not I/O is doing Context switch. You can see it in the below pictures


This doesn't happend in 1 socket machine, but if I force to use just one CPU (taskset -c 0-31 MyAPP) the performance is also poor.
What can I do to give you more information and try to help¿¿¿
Hi!!!
I'm using mumps with openblas 0.2.18 compiled from sources with this options:
I'm running my code in some machines multisocket (2 or 4) with 24 until 128 cores
The problem is that most of the time is executed by the system. To be more acurate is doing not I/O is doing Context switch. You can see it in the below pictures


This doesn't happend in 1 socket machine, but if I force to use just one CPU (taskset -c 0-31 MyAPP) the performance is also poor.
What can I do to give you more information and try to help¿¿¿