Skip to content

Inference regression DML 1.10.1->1.11 and higher #483

@divideconcept

Description

@divideconcept

I noticed an inference regression between DML 1.10.1 (and earlier) and DML 1.11 (and later), which causes the inference results to be completely off with some models. I'm not sure what node exactly cause the issue, but here's a complete repro step by step:

  1. Download model3.onnx
  2. Install ONNX Runtime for Python with DirectML (1.12) support : pip install onnxruntime-directml.
  3. Launch Python and run the following block, which shows the results for CPU inference (ground-truth):
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['CPUExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.9624936e-01  4.4031662e-01  4.9660692e-01 ...  4.6317300e-01
     4.7017282e-01  4.8335472e-01]
   [ 1.3661800e-01  3.4651098e-01  4.5258337e-01 ...  1.9679888e-01
     1.7567760e-01  1.5068966e-01]
   [ 8.6605281e-02  2.3559587e-01  2.5511262e-01 ...  1.8966110e-01
     1.6827986e-01  1.8628305e-01]
   ...
   [ 2.0687398e-01  1.7956746e-01  1.2259285e-01 ...  4.0398946e-01
     2.9999584e-01  2.3229304e-01]
   [ 2.1055967e-01  2.8771651e-01  1.5513927e-01 ...  2.2960059e-01
     9.8949686e-02  1.3984089e-01]
   [ 3.5148939e-01  5.4730177e-01  4.9234924e-01 ...  5.0844795e-01
     1.0927881e-01  8.2973397e-01]]

  [[ 2.2016920e-03  2.0630960e-03 -5.2188965e-04 ...  2.2417619e-03
    -1.7067563e-05 -1.7323773e-03]
   [ 1.7242560e-03  1.4197731e-03 -3.6929462e-03 ...  1.9717988e-02
     5.6085708e-03 -1.5628221e-04]
   [ 1.1211790e-03  2.6711330e-03 -1.9482106e-03 ...  2.7553817e-02
     2.2175895e-02 -1.0396730e-03]
   ...
   [ 3.3056617e-04  4.6207048e-03  2.2537552e-03 ... -1.4263104e-02
    -6.7468719e-03 -1.0978156e-03]
   [ 4.8008582e-04  3.7944347e-03  2.9231098e-03 ... -1.3265806e-02
     5.2854721e-03 -2.5849973e-03]
   [-1.1614685e+00  1.9571548e-02  9.1706263e-03 ... -3.7664244e-01
    -3.9552280e-01 -3.7509048e-01]]]]

In practice, those values are the expected output.

  1. Now run the following block, which shows the results for DML (1.12) inference:
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 1.2931919   1.2133068   0.92959875 ...  0.7315138   0.790178
     0.7206028 ]
   [ 0.72345215  0.75116944  0.73152304 ...  0.22733253  0.23081687
     0.25107992]
   [ 0.2179307   0.2573861   0.21388349 ...  0.39436132  0.4016724
     0.40784192]
   ...
   [ 0.08995652  0.19412872  0.17644493 ... -0.49595234 -0.5870603
    -0.6045674 ]
   [-0.7070265  -0.48113438 -0.59548837 ... -0.2593547  -0.13349809
    -0.4097974 ]
   [-0.30988705 -0.24486321 -0.37750056 ... -0.09476887 -0.10245383
     1.6842717 ]]

  [[ 1.908289    1.825213    1.5819263  ...  0.5750079   0.5997568
     0.5894706 ]
   [ 0.5852979   0.5780628   0.59584457 ...  0.08814019  0.10299458
     0.10440087]
   [ 0.11373571  0.09142485  0.08090715 ... -0.07507971 -0.05238473
    -0.07834692]
   ...
   [ 0.52536386  0.3986077   0.40048116 ... -2.4888163  -2.5107079
    -2.5601194 ]
   [-2.683926   -2.500065   -2.5673628  ... -2.6227055  -2.2424307
    -3.2082942 ]
   [-3.0153034  -2.7083983  -3.0444543  ... -2.6381683  -2.937557
    -0.8014389 ]]]]

Notice the results are completely different (and in practice, no useable value is produced)

  1. Exit Python, open the following subfolder in your Python folder Lib\site-packages\onnxruntime\capi and rename DirectML.dll to DirectML.bak.
  2. Download DirectML 1.10.1 (latest known version without the regression) and extract DirectML.dll using 7zip or NanaZip for instance (or Open DirectML 1.10.1 in NuGet Package Explorer and download DirectML.dll), and place it in next to DirectML.bak.
  3. Launch Python again and run again the DML code:
import onnxruntime as ort
import numpy as np
session=ort.InferenceSession("model3.onnx", providers=['DmlExecutionProvider'])
input0=np.ones((1,1,257,200),dtype=np.float32)
input1=np.ones((1,1,257,200),dtype=np.float32)
input2=np.ones((1,1,257,200),dtype=np.float32)
outputs=session.run(None, {"noisy_mag.1": input0, "noisy_real.1": input1, "noisy_imag.1": input2})
print(outputs[0])

This will show the following output:

[[[[ 2.96219379e-01  4.40306246e-01  4.96606797e-01 ...  4.63159323e-01
     4.70169514e-01  4.83349711e-01]
   [ 1.36600077e-01  3.46489936e-01  4.52565104e-01 ...  1.96821779e-01
     1.75700665e-01  1.50690824e-01]
   [ 8.65869373e-02  2.35554934e-01  2.55067706e-01 ...  1.89643666e-01
     1.68269396e-01  1.86276674e-01]
   ...
   [ 2.06858933e-01  1.79518670e-01  1.22561574e-01 ...  4.04009968e-01
     2.99959868e-01  2.32160479e-01]
   [ 2.10545927e-01  2.87681311e-01  1.55097321e-01 ...  2.59109288e-01
     1.26482189e-01  1.55689940e-01]
   [ 3.51458400e-01  5.47263682e-01  4.92354274e-01 ...  5.09262800e-01
     1.08978793e-01  8.29321682e-01]]

  [[ 2.20179441e-03  2.06306414e-03 -5.22566261e-04 ...  2.24243640e-03
    -1.67239050e-05 -1.73229771e-03]
   [ 1.72400963e-03  1.42002921e-03 -3.69322696e-03 ...  1.97228286e-02
     5.61172329e-03 -1.56276918e-04]
   [ 1.12099492e-03  2.67143198e-03 -1.94829446e-03 ...  2.75527705e-02
     2.21788641e-02 -1.04018056e-03]
   ...
   [ 3.30423441e-04  4.62117651e-03  2.25326000e-03 ... -1.42651405e-02
    -6.74304273e-03 -1.09608530e-03]
   [ 4.79940762e-04  3.79496464e-03  2.92291958e-03 ...  9.46796732e-04
     1.33081703e-02 -2.03865883e-03]
   [-1.16166353e+00  1.95743497e-02  9.17264353e-03 ... -3.78102601e-01
    -3.96833807e-01 -3.75986814e-01]]]]

Notice the results are very close, almost identicals to the CPU output. And in practice the values correspond to what is expected.

With this model, correct values are produced with DirectML 1.8, 1.9, 1.10, 1.10.1, and bad values are produced with DirectML 1.11, 1.12. Note that the final tensor shape is correct, only the values are wrong.

My DirectML device is a GeForce RTX 3090.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions