Add epilogue functor for residual block fusion by masahi · Pull Request #391 · NVIDIA/cutlass

masahi · 2021-12-23T14:14:26Z

Following the discussion in #347, this is my proposal for enabling full fusion of residual block in deep neural network, of the form UnaryOp(BinaryOp(ActivationOp(TensorOp(X) + bias), residual)) where residual has the same shape as the output tensor.

As discussed in #347, LinearCombinationBiasElementwise epilouge already supports elementwise((alpha x conv + beta x C) + per_channel_bias). What we need in deep learning, however, is elementwise((conv + per_channel_bias) + C). Initially I extended LinearCombinationBiasElementwise with another template parameter to specify in which order to do two additions, but I decided to add a new implementation of an epilogue functor from scratch for the following reasons:

We need ActivationOp template parameter to express the activation function to use after broadcast addition of per-channel bias
We don't need T output tensor (want to set kStoreT = false)
Less template paramters, More descriptive variable names than C, V etc
Use vectorization if possible

TODO: Currently one of the two new tests added in conv2d_fprop_with_broadcast_sm75.cu, fails on split_k_slice > 1 cases. Any hint on how to debug? I don't understand the logic in

cutlass/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h

Lines 451 to 455 in 808c253

    
           // Only the final block uses Vector 
        
           ((params.split_k_mode == SplitKMode::kSerial && params.grid_tiled_shape.k() > 1) && 
        
            (params.grid_tiled_shape.k() != threadblock_tile_idx.k() + 1)) 
        
               ? nullptr 
        
               : ptr_Vector,

@hwu36

hwu36 · 2021-12-24T06:17:33Z

To answer your splitk question, suppose you have two threadblocks, tb0 and tb1, processing the channels in the same n/p/q. tb0 will process the first half of the channels, and write the result to the global memory, and tb1 will process the 2nd half of the channels and add the result of tb0 and write out the final result to the global memory at last.

tb0 and tb1 use a semaphore to maintain the execution order here. tb0 is first, tb1 is the 2nd. tb0 and tb1 also share the exactly same output tensor space. tb0 writes its partial result to this space first, tb1 then read this partial result and write the final result to it eventually.

In the case of the original broadcast fused conv, when tb0 reaches this line, skip_elementwise_ will be set to true here. tb0's bias vector is set to be nullptr here which means it will skip bias vector. Then tb0 just compute alpha x first_half_of_conv + beta x C here and eventually output it to the global memory.

As to tb1, its C tensor is hard coded to be the output of the tb0's input here and tb1's beta is hard coded to be 1 here. So z = elementwise(plus(alpha_ x tmp_Accum[i] + beta_ x tmp_C[i], V[i])) = elementwise(plus(alpha x second_half_of_conv + tb0_parital_result, V[i])) which is what we want eventually.

masahi · 2021-12-24T08:41:33Z

@hwu36 Thank you very much, it is very interesting! The way split-k is done in conv2d was mysterious to me, since I always thought split-k requires two kernel launches but I didn't find the second one.

Now I understand where the bug is in my code: I have to skip the activation in UnaryOp(BinaryOp(ActivationOp(TensorOp(X) + bias), residual)) as well when skip_elemwise = true. One of two new tests is passing because it's ActivationOp is identity :)

I'll update the PR next week.

masahi · 2021-12-27T07:31:44Z

@hwu36 I realized that, the need to apply an activation op before BinaryOp in BinaryOp(ActivationOp(TensorOp(X) + bias), residual), makes it impossible to support split-k unless ActivationOp is Identity. tb0 computes alpha x first_half_of_conv + beta x C but the activation should not be applied to the beta x C part.

So I want to drop support for split-k when using this new epilogue functor, unless the activation op is Identity. Otherwise I had to change a lot of interfaces to plumb through all required arguments to the last block, where tthe activation and elemwise op are applied. Let me know if that's acceptable.

This is ready for review.

hwu36 · 2021-12-28T05:27:56Z

Let me know if that's acceptable.

It is okay. Please add a comment in the beginning of the testbed to explain the reason.

I will take a look of your code tomorrow.

This PR updates the intel graphics drivers for PVC --------- Co-authored-by: carlewis <carlewis@gmail.com>

* Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early

masahi force-pushed the epilouge-residual-block branch from 658e1d3 to 5603a93 Compare December 23, 2021 14:47

masahi added 2 commits December 27, 2021 16:24

Add epilogue functor for residual block fusion

be9cd6f

Do not run split-k tests when ActivationOp is not Identity

7b7b005

masahi force-pushed the epilouge-residual-block branch from 5603a93 to 7b7b005 Compare December 27, 2021 07:24

masahi marked this pull request as ready for review December 27, 2021 07:24

explain TestSplitK param

a53d14f

masahi force-pushed the epilouge-residual-block branch from b202897 to a53d14f Compare December 28, 2021 10:55

hwu36 reviewed Dec 29, 2021

View reviewed changes

Comment thread test/unit/conv/device/conv2d_with_broadcast_testbed.h Outdated

return early

a5af061

masahi force-pushed the epilouge-residual-block branch from 1adc0b8 to a5af061 Compare December 29, 2021 12:12

hwu36 approved these changes Dec 30, 2021

View reviewed changes

hwu36 merged commit c2ee13a into NVIDIA:master Dec 30, 2021

This was referenced Dec 30, 2021

[CUTLASS] Refactor cutlass kernel generation and selection apache/tvm#9800

Merged

[CUTLASS] Residual connection fusion apache/tvm#9820

Merged

sanchitintel pushed a commit to sanchitintel/cutlass that referenced this pull request Jun 24, 2025

Update PVC drivers (NVIDIA#391)

0c04a1f

This PR updates the intel graphics drivers for PVC --------- Co-authored-by: carlewis <carlewis@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add epilogue functor for residual block fusion#391

Add epilogue functor for residual block fusion#391
hwu36 merged 4 commits into
NVIDIA:masterfrom
masahi:epilouge-residual-block

masahi commented Dec 23, 2021 •

edited

Loading

Uh oh!

hwu36 commented Dec 24, 2021

Uh oh!

masahi commented Dec 24, 2021 •

edited

Loading

Uh oh!

masahi commented Dec 27, 2021 •

edited

Loading

Uh oh!

hwu36 commented Dec 28, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Only the final block uses Vector
	((params.split_k_mode == SplitKMode::kSerial && params.grid_tiled_shape.k() > 1) &&
	(params.grid_tiled_shape.k() != threadblock_tile_idx.k() + 1))
	? nullptr
	: ptr_Vector,

Uh oh!

Conversation

masahi commented Dec 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwu36 commented Dec 24, 2021

Uh oh!

masahi commented Dec 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masahi commented Dec 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hwu36 commented Dec 28, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

masahi commented Dec 23, 2021 •

edited

Loading

masahi commented Dec 24, 2021 •

edited

Loading

masahi commented Dec 27, 2021 •

edited

Loading