Skip to content

Add epilogue functor for residual block fusion#391

Merged
hwu36 merged 4 commits into
NVIDIA:masterfrom
masahi:epilouge-residual-block
Dec 30, 2021
Merged

Add epilogue functor for residual block fusion#391
hwu36 merged 4 commits into
NVIDIA:masterfrom
masahi:epilouge-residual-block

Conversation

@masahi

@masahi masahi commented Dec 23, 2021

Copy link
Copy Markdown
Contributor

Following the discussion in #347, this is my proposal for enabling full fusion of residual block in deep neural network, of the form UnaryOp(BinaryOp(ActivationOp(TensorOp(X) + bias), residual)) where residual has the same shape as the output tensor.

As discussed in #347, LinearCombinationBiasElementwise epilouge already supports elementwise((alpha x conv + beta x C) + per_channel_bias). What we need in deep learning, however, is elementwise((conv + per_channel_bias) + C). Initially I extended LinearCombinationBiasElementwise with another template parameter to specify in which order to do two additions, but I decided to add a new implementation of an epilogue functor from scratch for the following reasons:

  • We need ActivationOp template parameter to express the activation function to use after broadcast addition of per-channel bias
  • We don't need T output tensor (want to set kStoreT = false)
  • Less template paramters, More descriptive variable names than C, V etc
  • Use vectorization if possible

TODO: Currently one of the two new tests added in conv2d_fprop_with_broadcast_sm75.cu, fails on split_k_slice > 1 cases. Any hint on how to debug? I don't understand the logic in

// Only the final block uses Vector
((params.split_k_mode == SplitKMode::kSerial && params.grid_tiled_shape.k() > 1) &&
(params.grid_tiled_shape.k() != threadblock_tile_idx.k() + 1))
? nullptr
: ptr_Vector,
@hwu36

@masahi masahi force-pushed the epilouge-residual-block branch from 658e1d3 to 5603a93 Compare December 23, 2021 14:47
@hwu36

hwu36 commented Dec 24, 2021

Copy link
Copy Markdown
Collaborator

To answer your splitk question, suppose you have two threadblocks, tb0 and tb1, processing the channels in the same n/p/q. tb0 will process the first half of the channels, and write the result to the global memory, and tb1 will process the 2nd half of the channels and add the result of tb0 and write out the final result to the global memory at last.

tb0 and tb1 use a semaphore to maintain the execution order here. tb0 is first, tb1 is the 2nd. tb0 and tb1 also share the exactly same output tensor space. tb0 writes its partial result to this space first, tb1 then read this partial result and write the final result to it eventually.

In the case of the original broadcast fused conv, when tb0 reaches this line, skip_elementwise_ will be set to true here. tb0's bias vector is set to be nullptr here which means it will skip bias vector. Then tb0 just compute alpha x first_half_of_conv + beta x C here and eventually output it to the global memory.

As to tb1, its C tensor is hard coded to be the output of the tb0's input here and tb1's beta is hard coded to be 1 here. So z = elementwise(plus(alpha_ x tmp_Accum[i] + beta_ x tmp_C[i], V[i])) = elementwise(plus(alpha x second_half_of_conv + tb0_parital_result, V[i])) which is what we want eventually.

@masahi

masahi commented Dec 24, 2021

Copy link
Copy Markdown
Contributor Author

@hwu36 Thank you very much, it is very interesting! The way split-k is done in conv2d was mysterious to me, since I always thought split-k requires two kernel launches but I didn't find the second one.

Now I understand where the bug is in my code: I have to skip the activation in UnaryOp(BinaryOp(ActivationOp(TensorOp(X) + bias), residual)) as well when skip_elemwise = true. One of two new tests is passing because it's ActivationOp is identity :)

I'll update the PR next week.

@masahi masahi force-pushed the epilouge-residual-block branch from 5603a93 to 7b7b005 Compare December 27, 2021 07:24
@masahi masahi marked this pull request as ready for review December 27, 2021 07:24
@masahi

masahi commented Dec 27, 2021

Copy link
Copy Markdown
Contributor Author

@hwu36 I realized that, the need to apply an activation op before BinaryOp in BinaryOp(ActivationOp(TensorOp(X) + bias), residual), makes it impossible to support split-k unless ActivationOp is Identity. tb0 computes alpha x first_half_of_conv + beta x C but the activation should not be applied to the beta x C part.

So I want to drop support for split-k when using this new epilogue functor, unless the activation op is Identity. Otherwise I had to change a lot of interfaces to plumb through all required arguments to the last block, where tthe activation and elemwise op are applied. Let me know if that's acceptable.

This is ready for review.

@hwu36

hwu36 commented Dec 28, 2021

Copy link
Copy Markdown
Collaborator

Let me know if that's acceptable.

It is okay. Please add a comment in the beginning of the testbed to explain the reason.

I will take a look of your code tomorrow.

@masahi masahi force-pushed the epilouge-residual-block branch from b202897 to a53d14f Compare December 28, 2021 10:55
Comment thread test/unit/conv/device/conv2d_with_broadcast_testbed.h Outdated
@masahi masahi force-pushed the epilouge-residual-block branch from 1adc0b8 to a5af061 Compare December 29, 2021 12:12
@hwu36 hwu36 merged commit c2ee13a into NVIDIA:master Dec 30, 2021
sanchitintel pushed a commit to sanchitintel/cutlass that referenced this pull request Jun 24, 2025
This PR updates the intel graphics drivers for PVC

---------

Co-authored-by: carlewis <carlewis@gmail.com>
Albresky pushed a commit to Albresky/cutlass that referenced this pull request Oct 11, 2025
* Add epilogue functor for residual block fusion

* Do not run split-k tests when ActivationOp is not Identity

* explain TestSplitK param

* return early
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants