Inference regression DML 1.10.1->1.11 and higher

I noticed an inference regression between DML 1.10.1 (and earlier) and DML 1.11 (and later), which causes the inference results to be completely off with some models. I'm not sure what node exactly cause the issue, but here's a complete repro step by step:

1. Download [model3.onnx](https://drive.google.com/file/d/1F6PprLYvFRjpFaDKEZNdrn4lKA6WUDrg/view?usp=drive_link)
2. Install ONNX Runtime for Python with DirectML (1.12) support : `pip install onnxruntime-directml`.
3. Launch Python and run the following block, which shows the results for CPU inference (ground-truth):
```
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['CPUExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])
```
This will show the following output:
```
[[[[ 2.9624936e-01  4.4031662e-01  4.9660692e-01 ...  4.6317300e-01
     4.7017282e-01  4.8335472e-01]
   [ 1.3661800e-01  3.4651098e-01  4.5258337e-01 ...  1.9679888e-01
     1.7567760e-01  1.5068966e-01]
   [ 8.6605281e-02  2.3559587e-01  2.5511262e-01 ...  1.8966110e-01
     1.6827986e-01  1.8628305e-01]
   ...
   [ 2.0687398e-01  1.7956746e-01  1.2259285e-01 ...  4.0398946e-01
     2.9999584e-01  2.3229304e-01]
   [ 2.1055967e-01  2.8771651e-01  1.5513927e-01 ...  2.2960059e-01
     9.8949686e-02  1.3984089e-01]
   [ 3.5148939e-01  5.4730177e-01  4.9234924e-01 ...  5.0844795e-01
     1.0927881e-01  8.2973397e-01]]

  [[ 2.2016920e-03  2.0630960e-03 -5.2188965e-04 ...  2.2417619e-03
    -1.7067563e-05 -1.7323773e-03]
   [ 1.7242560e-03  1.4197731e-03 -3.6929462e-03 ...  1.9717988e-02
     5.6085708e-03 -1.5628221e-04]
   [ 1.1211790e-03  2.6711330e-03 -1.9482106e-03 ...  2.7553817e-02
     2.2175895e-02 -1.0396730e-03]
   ...
   [ 3.3056617e-04  4.6207048e-03  2.2537552e-03 ... -1.4263104e-02
    -6.7468719e-03 -1.0978156e-03]
   [ 4.8008582e-04  3.7944347e-03  2.9231098e-03 ... -1.3265806e-02
     5.2854721e-03 -2.5849973e-03]
   [-1.1614685e+00  1.9571548e-02  9.1706263e-03 ... -3.7664244e-01
    -3.9552280e-01 -3.7509048e-01]]]]
```
In practice, those values are the expected output.

4. Now run the following block, which shows the results for DML (1.12) inference:
```
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])
```
This will show the following output:
```
[[[[ 1.2931919   1.2133068   0.92959875 ...  0.7315138   0.790178
     0.7206028 ]
   [ 0.72345215  0.75116944  0.73152304 ...  0.22733253  0.23081687
     0.25107992]
   [ 0.2179307   0.2573861   0.21388349 ...  0.39436132  0.4016724
     0.40784192]
   ...
   [ 0.08995652  0.19412872  0.17644493 ... -0.49595234 -0.5870603
    -0.6045674 ]
   [-0.7070265  -0.48113438 -0.59548837 ... -0.2593547  -0.13349809
    -0.4097974 ]
   [-0.30988705 -0.24486321 -0.37750056 ... -0.09476887 -0.10245383
     1.6842717 ]]

  [[ 1.908289    1.825213    1.5819263  ...  0.5750079   0.5997568
     0.5894706 ]
   [ 0.5852979   0.5780628   0.59584457 ...  0.08814019  0.10299458
     0.10440087]
   [ 0.11373571  0.09142485  0.08090715 ... -0.07507971 -0.05238473
    -0.07834692]
   ...
   [ 0.52536386  0.3986077   0.40048116 ... -2.4888163  -2.5107079
    -2.5601194 ]
   [-2.683926   -2.500065   -2.5673628  ... -2.6227055  -2.2424307
    -3.2082942 ]
   [-3.0153034  -2.7083983  -3.0444543  ... -2.6381683  -2.937557
    -0.8014389 ]]]]
```
Notice the results are completely different (and in practice, no useable value is produced)

5. Exit Python, open the following subfolder in your Python folder `Lib\site-packages\onnxruntime\capi` and rename DirectML.dll to DirectML.bak.
6. [Download DirectML 1.10.1](https://www.nuget.org/api/v2/package/Microsoft.AI.DirectML/1.10.1) (latest known version without the regression) and extract DirectML.dll using 7zip or NanaZip for instance (or [Open DirectML 1.10.1 in NuGet Package Explorer](https://nuget.info/packages/Microsoft.AI.DirectML/1.10.1) and download DirectML.dll), and place it in next to DirectML.bak.
7. Launch Python again and run again the DML code:
```
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])
```
This will show the following output:
```
[[[[ 2.96219379e-01  4.40306246e-01  4.96606797e-01 ...  4.63159323e-01
     4.70169514e-01  4.83349711e-01]
   [ 1.36600077e-01  3.46489936e-01  4.52565104e-01 ...  1.96821779e-01
     1.75700665e-01  1.50690824e-01]
   [ 8.65869373e-02  2.35554934e-01  2.55067706e-01 ...  1.89643666e-01
     1.68269396e-01  1.86276674e-01]
   ...
   [ 2.06858933e-01  1.79518670e-01  1.22561574e-01 ...  4.04009968e-01
     2.99959868e-01  2.32160479e-01]
   [ 2.10545927e-01  2.87681311e-01  1.55097321e-01 ...  2.59109288e-01
     1.26482189e-01  1.55689940e-01]
   [ 3.51458400e-01  5.47263682e-01  4.92354274e-01 ...  5.09262800e-01
     1.08978793e-01  8.29321682e-01]]

  [[ 2.20179441e-03  2.06306414e-03 -5.22566261e-04 ...  2.24243640e-03
    -1.67239050e-05 -1.73229771e-03]
   [ 1.72400963e-03  1.42002921e-03 -3.69322696e-03 ...  1.97228286e-02
     5.61172329e-03 -1.56276918e-04]
   [ 1.12099492e-03  2.67143198e-03 -1.94829446e-03 ...  2.75527705e-02
     2.21788641e-02 -1.04018056e-03]
   ...
   [ 3.30423441e-04  4.62117651e-03  2.25326000e-03 ... -1.42651405e-02
    -6.74304273e-03 -1.09608530e-03]
   [ 4.79940762e-04  3.79496464e-03  2.92291958e-03 ...  9.46796732e-04
     1.33081703e-02 -2.03865883e-03]
   [-1.16166353e+00  1.95743497e-02  9.17264353e-03 ... -3.78102601e-01
    -3.96833807e-01 -3.75986814e-01]]]]
```
Notice the results are very close, almost identicals to the CPU output. And in practice the values correspond to what is expected.

With this model, correct values are produced with DirectML 1.8, 1.9, 1.10, 1.10.1, and bad values are produced with DirectML 1.11, 1.12. Note that the final tensor shape is correct, only the values are wrong.

My DirectML device is a GeForce RTX 3090.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference regression DML 1.10.1->1.11 and higher #483

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Inference regression DML 1.10.1->1.11 and higher #483

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions