On GPGPU core sharing and preemption【高通吧】

First, comes terminology:
warp: similar SIMD thread on CPU
threadblock: A collection of warp that is guaranteed to be run on the same SM
SM: minimum unit of execution core.
GPU core: a collection of SM(assumed to be 4 here).
The current GPU core design all converges to using 4 SM inside a GPU core. Which have independent schedulers and execution units, but they all share the same local data cache. At a high level, the top-level scheduler breaks down a kernel into multiple thread blocks, and these thread blocks are granted to be scheduled into one GPU core. And once a threadblock is emitted into a GPU core, it’s up to the core scheduler to decide how the warps to SM. Nowadays, different warps of different thread blocks of different kernels can all be scheduled into the same GPU core to achieve maximum utilization. So what the core scheduler does is to dispatch as many warps as possible to the SM (to hide latency). That being said, the number of warps can be scheduled into an SM depending on the resources that can be shared by different threads e.g. register file size, branch convergence stack, etc.
Here is an example we have 2 thread blocks each with a warp size of 32 dispatched into the core scheduler and at each cycle, the core scheduler checks how many threads are executing in the SM once there is space(register space, convergence stack, and other states required to run the thread, etc..) and enough space in local shared memory, the warps will be dispatched. Inside the SM, there is one more scheduler that actively chooses warp to execute from the active set of warps(and this active set of warps is contained inside from the larger set of running warps), each cycle one warp is dispatched and decoded. Then they reach another queue(a small one). Each cycle in that queue, another scheduler actively detects dependence on the warp instruction and tries to find one to issue into the execution pipeline. In the end, once a warp has finished execution, its register and branch stack are released; This allowed another warp to enter execution of the current warp.
Preemption
Now we allow concurrent execution of different kernel(or even process)’s warp in a GPU core, but we still lack time sharing feature. For example, imagine you have a long-running background computation task but you still would like to refresh your display every 1/60 sec. You can do this at the thread block level where you move the thread blocks of the render kernel to the front of the core scheduler waitlist and the next time you have finished executing the preceding thread blocks you can dispatch the warp from the render thread blocks rather than the computation ones. In case the threadblocks take too long to execute, you can preempt at the instruction level, this is costly it requires you to save all the current register context and other states, then swap in the threadblocks of the render thread. once you have finished executing the rendering threads, you can recover the register state and start execution of the previous thread as usual.
Apple gives 64kb of scratchpad memory to each GPU core and a thread block can use up to 32kb of it. I think this design is for the purpose that at any cycle, the core can keep two at least threadblocks active(if other resources are not constrained). So when execution meets the boundary of one thread block, the core scheduler can start dispatching threads from the next thread block if there are free slots in those SMs.
So basically things happen like this, there is a waitlist inside each GPU core, and the top-level scheduler dispatches threadblocks to each GPU core. On every cycle the core scheduler checks if it is possible to dispatch thread blocks from the waitlist(i.e. check for scratchpad usage and free slots in SMs). If all resources are preserved, dispatch the warp into the SM, else continue to wait for the free slots.

送TA礼物

IP属地:广东

1楼2024-05-05 21:28回复

References:
1) https://upcommons.upc.edu/bitstream/handle/2117/26093/isca2014.pdf Enabling Preemptive Multiprogramming on GPUs
2) https://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia%27s_fermi-the_first_complete_gpu_architecture.pdf NVIDIA’s Fermi: The First Complete GPU Computing Architecture
3) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf NVIDIA Kepler GK110 Architecture Whitepaper.pdf
4) https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10

IP属地:广东

2楼2024-05-05 21:34

｡◕‿◕｡

IP属地:安徽

3楼2024-05-05 21:49

这种贴子13就不会来

IP属地:上海

来自Android客户端4楼2024-05-05 21:52

收起回复

首先，我们来了解一下术语：
Warp：类似于CPU上的SIMD线程。
Threadblock：一组保证在同一个流式多处理器（SM）上运行的Warp。
SM（Streaming Multiprocessor）：执行核心的最小单位。
GPU核心：一组SM的集合（这里假设为4个）。
当前GPU核心设计趋向于在每个GPU核心内部包含4个SM。这些SM拥有独立的调度器和执行单元，但它们共享同一局部数据缓存。在高层面上，顶级调度器将一个内核（kernel）分解成多个线程块（thread block），这些线程块被分配到一个GPU核心进行调度。一旦线程块被发送到GPU核心，就由该核心的调度器决定如何将Warp分配给各个SM。如今，来自不同内核、不同线程块的不同Warp都可以被调度到同一个GPU核心中，以实现最大利用率。因此，核心调度器的工作是尽可能多地向SM分发Warp（以隐藏延迟）。然而，能够被调度到一个SM中的Warp数量取决于可由不同线程共享的资源，例如寄存器文件大小、分支收敛堆栈等。
举个例子，假设有两个线程块，每个块含32个Warp，被送入核心调度器。在每个周期，核心调度器会检查SM中有多少线程正在执行，一旦有空间（如寄存器空间、收敛堆栈和其他运行线程所需的状态等）和足够的局部共享内存空间，Warp就会被分发。在SM内部，还有一个调度器主动从活动Warp集中选择Warp进行执行（这个活跃集是从更大范围的运行Warp集中挑选出来的）。每个周期分发并解码一个Warp，然后它们进入另一个队列（一个小的）。在那个队列中，每个周期还有一个调度器主动检测Warp指令的依赖关系，并试图找到一个可以发布到执行管线的指令。最终，当一个Warp执行完毕，其寄存器和分支堆栈会被释放，允许另一个Warp开始执行。
抢占（Preemption）现在我们允许在GPU核心中并发执行来自不同内核（甚至进程）的Warp，但我们仍然缺乏时间共享特性。例如，想象你有一个长时间运行的后台计算任务，但你还希望每1/60秒刷新一次显示。你可以在线程块级别做到这一点，将渲染内核的线程块移到核心调度器等待列表的前面，当下一次完成前面线程块的执行时，就可以调度来自渲染线程块的Warp而不是计算线程块的。如果线程块执行时间过长，你可以在指令级别进行抢占，这是代价高昂的，需要保存当前所有寄存器上下文和其他状态，然后切换入渲染线程块。一旦完成渲染线程的执行，可以恢复寄存器状态并像往常一样继续执行之前的线程。
苹果为每个GPU核心提供了64KB的暂存内存，而一个线程块可以使用其中最多32KB。我认为这种设计是为了确保在任何周期，核心都能保持至少两个线程块活跃（如果其他资源不受限制）。因此，当执行遇到一个线程块的边界时，如果那些SM中有空闲槽位，核心调度器就可以开始从下一个线程块分发线程。
基本上，事情是这样发生的：每个GPU核心内部有一个等待列表，顶级调度器将线程块分配给每个GPU核心。在每个周期，核心调度器会检查是否可以从等待列表中分发线程块（即检查暂存内存使用情况和SM中的空闲槽位）。如果所有资源都得到保留，则将Warp分发到SM中；否则，继续等待空闲槽位出现。

IP属地:浙江

5楼2024-05-06 13:10

日	一	二	三	四	五	六

On GPGPU core sharing and preemption

登录百度账号

扫二维码下载贴吧客户端