[ONNX] Collect quant params of pre-quantized ONNX and generate qnn op#7937
[ONNX] Collect quant params of pre-quantized ONNX and generate qnn op#7937huochaitiantang wants to merge 1 commit into
Conversation
|
This is awesome, thanks for the submission! It looks like the ONNX and ORT we have in CI might be too out of date to support this? Could you point me at the documentation for this behavior? I wasn't aware ONNX did this, I thought they primarily used ops like ConvInteger? I'm hoping to add support for importing those ops in the coming weeks. Is there a more recent version of ORT that supports this? If so, could you add the test line and comment it out with a TODO for renabling once we update to some minimum ORT? I'm hoping to update ORT and ONNX versions in CI in the next month or two. Thanks! |
|
@mbrookhart Thanks for your comment!
It shows that the above operators cannot accept int8 tensor in ORT, so these operators between QuantizeLinear and DequantizeLinear cannot run in ORT, even if they should be quantized. My question is that, even if ORT cannot run these models successfully, should TVM support the import of them and generate correct qnn ops? It will determine whether this PR is necessary. In addition, ORT quantization operators like |
|
Hi @huochaitiantang , That all makes sense, I'm really just wondering if we've ever seen an example of this in the wild, i.e. q->conv->dq instead of q->convinteger->dq. Do you have an example model defined this way somewhere? I mostly just want to make sure we don't implement a feature that's out of spec. Thanks, |
|
Yeah i just wanted to echo @mbrookhart that it would be great to include a test that operates on a real quantized model. Ideally you could start at a prequantized tf or pytorch model, export it to onnx, then import that using the changes in this PR. |
|
Hi, @mbrookhart @jwfromm Thanks for your advice. We have tried to export a real pre-quantized ONNX model from popular frameworks. But it seems difficult.
The pattern QuantizeLinear -> Conv -> DequantizeLinear may not appear in pre-quantized ONNX models exported by pytorch, tflite, or onnxruntime. So we can close this PR. |
|
Sounds good. We found the "fake quantization" in the tflite exports a couple of weeks ago, I'm currently working on a pass to convert that into QNN after import. I'll close this, but thanks for the QLinearConv PR! |
For the import of pre-quantized ONNX models, nodes (like
Conv,Gemm,Add,Mul,Sub) between QuantizeLinear and DequantizeLinear should be quantized.This PR is summarized as follows:
1, Collect quantize params for nodes to be quantized, which locate between QuantizeLinear and DequantizeLinear.
2, Generate corresponding qnn ops (like
qnn.conv2d,qnn.dense,qnn.add,qnn.mul,qnn.subtract) for nodes that can be quantized.