Add epilogue functor for residual block fusion#391
Conversation
658e1d3 to
5603a93
Compare
|
To answer your splitk question, suppose you have two threadblocks, tb0 and tb1, processing the channels in the same n/p/q. tb0 will process the first half of the channels, and write the result to the global memory, and tb1 will process the 2nd half of the channels and add the result of tb0 and write out the final result to the global memory at last. tb0 and tb1 use a semaphore to maintain the execution order here. tb0 is first, tb1 is the 2nd. tb0 and tb1 also share the exactly same output tensor space. tb0 writes its partial result to this space first, tb1 then read this partial result and write the final result to it eventually. In the case of the original broadcast fused conv, when tb0 reaches this line, As to tb1, its |
|
@hwu36 Thank you very much, it is very interesting! The way split-k is done in conv2d was mysterious to me, since I always thought split-k requires two kernel launches but I didn't find the second one. Now I understand where the bug is in my code: I have to skip the activation in I'll update the PR next week. |
5603a93 to
7b7b005
Compare
|
@hwu36 I realized that, the need to apply an activation op before So I want to drop support for split-k when using this new epilogue functor, unless the activation op is This is ready for review. |
It is okay. Please add a comment in the beginning of the testbed to explain the reason. I will take a look of your code tomorrow. |
b202897 to
a53d14f
Compare
1adc0b8 to
a5af061
Compare
This PR updates the intel graphics drivers for PVC --------- Co-authored-by: carlewis <carlewis@gmail.com>
* Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early
Following the discussion in #347, this is my proposal for enabling full fusion of residual block in deep neural network, of the form
UnaryOp(BinaryOp(ActivationOp(TensorOp(X) + bias), residual))whereresidualhas the same shape as the output tensor.As discussed in #347,
LinearCombinationBiasElementwiseepilouge already supportselementwise((alpha x conv + beta x C) + per_channel_bias). What we need in deep learning, however, iselementwise((conv + per_channel_bias) + C). Initially I extendedLinearCombinationBiasElementwisewith another template parameter to specify in which order to do two additions, but I decided to add a new implementation of an epilogue functor from scratch for the following reasons:ActivationOptemplate parameter to express the activation function to use after broadcast addition of per-channel biasToutput tensor (want to setkStoreT = false)C,VetcTODO: Currently one of the two new tests added in
conv2d_fprop_with_broadcast_sm75.cu, fails onsplit_k_slice > 1cases. Any hint on how to debug? I don't understand the logic incutlass/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h
Lines 451 to 455 in 808c253