Skip to content

[CUDA] Allow dynamic shmem of size > 48K in runtime#11478

Merged
junrushao merged 4 commits into
apache:mainfrom
masahi:more-dyn-shmem
May 27, 2022
Merged

[CUDA] Allow dynamic shmem of size > 48K in runtime#11478
junrushao merged 4 commits into
apache:mainfrom
masahi:more-dyn-shmem

Conversation

@masahi

@masahi masahi commented May 26, 2022

Copy link
Copy Markdown
Member

Currently, we have functioning dynamic shared memory support on cuda. But we haven't actually explored allocating more than 48KB of dynamic shmem.

This PR updates the cuda runtime to support launching a kernel which wants to use dyn shmem of size > 48KB. This is already useful for manually rewritten schedules, but to integrate this feature into tuning requires more work (see the discussion on VerifyGPUCode below).

I'll add a test which actually uses a big dyn shmem in the next PR (need to fix one bug in software pipelining transform).

Reference in cutlass code:
https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/device/gemm.h#L479-L482

@vinx13 @junrushao1994 @tqchen @yzh119 @Hzfengsy


if (fcache_[device_id] == nullptr) {
fcache_[device_id] = m_->GetFunc(device_id, func_name_);
if (wl.dyn_shmem_size >= (48 << 10)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dynamic memory is too large, will it pass VerifyGPUCode check?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested but yeah, it seems VerifyGPUCode checks the static alloc size against max_shared_memory_per_block, which would fail if dyn_shmem_size >= (48 << 10)

} else if (storage_scope.rank == runtime::StorageRank::kShared) {
size_t size = static_cast<size_t>(op->ConstantAllocationSize());

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we defer this issue later? I need this to demonstrate that a multi-stage pipeline with depth > 2 works on a semi-realistic cuda schedule.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's defer this particular issue

@masahi masahi changed the title [CUDA] Allow dynamic shmem of size > 48K [CUDA] Allow dynamic shmem of size > 48K in runtime May 26, 2022
fcache_[device_id] = m_->GetFunc(device_id, func_name_);
if (wl.dyn_shmem_size >= (48 << 10)) {
// Assumption: dyn_shmem_size doesn't change across different invocations of
// fcache_[device_id]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption could be controversial, but this should be mostly ok in practice. To support a kernel which uses different big shmem sizes depending on input, we need to call cuFuncSetAttribute on every invocation.

@junrushao

Copy link
Copy Markdown
Member

Thanks @masahi @vinx13 @Hzfengsy, it's merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants