Notes about CUDA Green Context

本文将mark下CUDA Green Context的相关notes。

Introduction

Green contexts are a lightweight alternative to traditional contexts, that can be used to select a subset of device resources. This allows the developer to, for example, select SMs from distinct spatial partitions of the GPU and target them via CUDA stream operations, kernel launches, etc.

CUDA 中的 Green Context 是一种轻量级的上下文形式，可作为传统 CUDA 上下文的替代方案，为开发者提供更细粒度的 GPU 空间划分与资源分配能力。

Green Context 使用户能够定义和管理 GPU 资源的独立分区，主要是Streaming Multiprocessors（SM)。你可以将特定数量的 SM 分配给某个特定的 Green Context ，然后在该 context 所拥有的资源范围内启动 CUDA kernel 并管理只在此 context 内运行的 stream。

技术定位

Green Contexts 介于 CUDA Streams（动态无分区）和 MPS（有分区但动态不足）之间，支持单进程内动态 SM 资源分区，提供确定性的非对称执行能力。

Green Context enables lightweight intra-process spatial sharing. Briefly, we can create multiple CUDA streams with dedicated SM allocations for concurrent GPU kernels.

Green Context解决了GPU资源调度的难问题——把”碰运气式”并行升级为”精准规划”。它通过预分配SM资源池，让关键任务能独占专属算力，既保证低延迟响应，又避免多任务互相干扰。

应用场景

一个典型的应用场景是：你的程序中有部分代码对延迟极为敏感，并且需要优先于其他所有 GPU 工作执行。通过为这段代码单独创建一个 Green Context 并分配 SM 资源，而将剩余的 SM 分配给另一个 Green Context 处理其他任务，你就能确保始终有可用的 SM 供高优先级计算使用。

ASPLOS’26的paper: Towards High-Goodput LLM Serving with Prefill-decode Multiplexing，就是基于Green Context做的工作，blog可参见PD-Multiplexing: Unlocking High-Goodput LLM Serving with GreenContext。

创建与使用流程

查询 GPU 的 SM 数量（Blackwell 为 160）
制定资源分区策略
创建分区描述符
生成 Green Context 沙箱
从 Context 创建 CUDA Stream
正常向 Stream 提交核函数

总结

Green Context 的核心价值，不是简单地“限制 kernel 能用多少个 SM”，而是给 CUDA 提供了一种更可控的资源管理方式。它把原本机会式的并行，进一步变成了可规划的并行。通过把 SM资源按 context 进行划分，应用可以在低时延、吞吐和隔离性之间做更灵活的权衡。

对需要保障关键任务响应时间、减少多 stream 相互干扰、或验证不同资源配比效果的场景来说，Green Context 提供了一种几乎不需要修改 kernel、只需少量 host 侧改造就能落地的新手段。

参考资料:

文章目录

Introduction

技术定位

应用场景

创建与使用流程

总结