While I was working on some code for big-integer accumulation on GPUs, I discovered that atomic integer operations on local memory are broken in oneAPI.jl. I've simplified the problematic code down to the following minimal reproducer:
using oneAPI
using KernelAbstractions
using Atomix: @atomic
@kernel function test_kernel!(result::AbstractVector{UInt32})
i = @index(Global, Linear)
g = @index(Group, Linear)
l = @index(Local, Linear)
# Initialize an accumulator in local memory.
a = @localmem(UInt32, (1,))
if isone(l)
a[1] = 0
end
@synchronize()
# Every thread atomically adds its index to the accumulator.
@atomic a[1] += UInt32(i)
@synchronize()
# The first thread in each workgroup writes the sum to the output array.
if isone(l)
result[g] = a[1]
end
end
function run_test(backend::Backend)
result = KernelAbstractions.allocate(backend, UInt32, 256)
test_kernel!(backend, 256)(result; ndrange=65536)
return Vector{UInt32}(result)
end
const REFERENCE_VALUES = [65536 * UInt32(g) - 32640 for g = 1:256]
for trial = 1:10
println("Trial $trial:")
test_values = run_test(oneAPIBackend())
for (g, (ref_val, test_val)) in enumerate(zip(REFERENCE_VALUES, test_values))
if ref_val != test_val
println(" Workgroup $g miscalculated: expected $ref_val, got $test_val")
end
end
end
This program consistently produces incorrect output in 1-3% of workgroups, with nondeterminism in which particular workgroups miscompute. A typical output on my system looks like this:
Trial 1:
Workgroup 62 miscalculated: expected 4030592, got 2764760
Workgroup 64 miscalculated: expected 4161664, got 2855384
Trial 2:
Workgroup 20 miscalculated: expected 1278080, got 553912
Workgroup 62 miscalculated: expected 4030592, got 2765272
Workgroup 64 miscalculated: expected 4161664, got 3125344
Trial 3:
Workgroup 19 miscalculated: expected 1212544, got 1136120
Workgroup 20 miscalculated: expected 1278080, got 1116016
Workgroup 23 miscalculated: expected 1474688, got 1383672
Workgroup 46 miscalculated: expected 2982016, got 2608496
Workgroup 57 miscalculated: expected 3702912, got 2774368
Workgroup 62 miscalculated: expected 4030592, got 3779064
Workgroup 63 miscalculated: expected 4096128, got 3839736
Trial 4:
Workgroup 20 miscalculated: expected 1278080, got 1199096
Workgroup 46 miscalculated: expected 2982016, got 2796536
Workgroup 59 miscalculated: expected 3833984, got 3593208
Trial 5:
Workgroup 15 miscalculated: expected 950400, got 889080
Workgroup 18 miscalculated: expected 1147008, got 1075960
Workgroup 23 miscalculated: expected 1474688, got 1289072
Workgroup 54 miscalculated: expected 3506304, got 3286264
Workgroup 57 miscalculated: expected 3702912, got 3471864
Trial 6:
Workgroup 21 miscalculated: expected 1343616, got 1258488
Workgroup 22 miscalculated: expected 1409152, got 1319416
Workgroup 24 miscalculated: expected 1540224, got 956752
Workgroup 46 miscalculated: expected 2982016, got 2793720
Workgroup 56 miscalculated: expected 3637376, got 3408120
Workgroup 58 miscalculated: expected 3768448, got 3531256
Trial 7:
Workgroup 19 miscalculated: expected 1212544, got 1063536
Workgroup 20 miscalculated: expected 1278080, got 1039592
Workgroup 21 miscalculated: expected 1343616, got 1174384
Workgroup 22 miscalculated: expected 1409152, got 1233008
Workgroup 23 miscalculated: expected 1474688, got 1380856
Workgroup 49 miscalculated: expected 3178624, got 2980856
Workgroup 52 miscalculated: expected 3375232, got 2952560
Workgroup 54 miscalculated: expected 3506304, got 3287288
Trial 8:
Workgroup 14 miscalculated: expected 884864, got 828920
Workgroup 18 miscalculated: expected 1147008, got 1075704
Workgroup 21 miscalculated: expected 1343616, got 1005152
Workgroup 49 miscalculated: expected 3178624, got 2979064
Workgroup 56 miscalculated: expected 3637376, got 3181936
Trial 9:
Workgroup 19 miscalculated: expected 1212544, got 1059440
Workgroup 20 miscalculated: expected 1278080, got 1040104
Workgroup 22 miscalculated: expected 1409152, got 1230192
Workgroup 23 miscalculated: expected 1474688, got 1383160
Workgroup 46 miscalculated: expected 2982016, got 2796024
Workgroup 61 miscalculated: expected 3965056, got 3716344
Workgroup 63 miscalculated: expected 4096128, got 3841272
Trial 10:
Workgroup 57 miscalculated: expected 3702912, got 3008488
Workgroup 62 miscalculated: expected 4030592, got 3273960
Workgroup 63 miscalculated: expected 4096128, got 3838712
Workgroup 64 miscalculated: expected 4161664, got 3901944
I'm happy to provide further details of my setup if it would be useful for debugging. Here's what I know about the problem so far:
While I was working on some code for big-integer accumulation on GPUs, I discovered that atomic integer operations on local memory are broken in oneAPI.jl. I've simplified the problematic code down to the following minimal reproducer:
This program consistently produces incorrect output in 1-3% of workgroups, with nondeterminism in which particular workgroups miscompute. A typical output on my system looks like this:
Details of the machine I'm running on:
I'm happy to provide further details of my setup if it would be useful for debugging. Here's what I know about the problem so far:
Int32andUInt32both trigger the bug, so it does not appear to be related to signedness.Int64andUInt64also trigger the bug if you change the lines:result[g] = a[1] & 0x0000FFFFFFFFFFFFdoes not trigger the bug; it's the writeback to local memory that does it. Again, bothInt64andUInt64are affected, so the bug appears unrelated to signedness.