Basic texture support
This commit is contained in:
244
draw/README.md
244
draw/README.md
@@ -47,99 +47,107 @@ primitives and effects can be added to the library without architectural changes
|
||||
|
||||
### Overview: three pipelines
|
||||
|
||||
The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
|
||||
**render-state requirements**:
|
||||
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
|
||||
**render-pass structure** (everything vs backdrop):
|
||||
|
||||
1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
|
||||
registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
|
||||
frame.
|
||||
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
|
||||
footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
|
||||
Handles 90%+ of all fragments in a typical frame.
|
||||
|
||||
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
||||
effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
|
||||
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
||||
redundant overdraw.
|
||||
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
|
||||
low-end hardware (see register analysis below).
|
||||
|
||||
3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
|
||||
render target as input. High register footprint (~70–80 registers) and structurally requires a
|
||||
`CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
|
||||
pressure and because the texture-copy requirement forces a render-pass-level state change.
|
||||
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
|
||||
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
|
||||
where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
|
||||
from the other pipelines because it structurally requires ending the current render pass and
|
||||
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
|
||||
level boundary that cannot be avoided regardless of shader complexity.
|
||||
|
||||
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
||||
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
||||
switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
|
||||
is under 0.15% of an 8.3ms (120 FPS) frame budget.
|
||||
switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching
|
||||
overhead is negligible relative to an 8.3ms (120 FPS) frame budget.
|
||||
|
||||
### Why three pipelines, not one or seven
|
||||
|
||||
The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
|
||||
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
|
||||
|
||||
The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
|
||||
shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
|
||||
compiler allocates registers pessimistically based on the worst-case path through the shader. If the
|
||||
shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
|
||||
— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
|
||||
warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
|
||||
#### Main/effects split: register pressure
|
||||
|
||||
Concrete occupancy analysis on modern NVIDIA SMs, which have 65,536 32-bit registers and a
|
||||
hardware-imposed maximum thread count per SM that varies by architecture (Volta/A100: 2,048;
|
||||
consumer Ampere/Ada: 1,536). Occupancy is register-limited only when `65536 / regs_per_thread` falls
|
||||
below the hardware thread cap; above that cap, occupancy is 100% regardless of register count.
|
||||
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
|
||||
allocates registers pessimistically based on the worst-case path through the shader. If the shader
|
||||
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
|
||||
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
|
||||
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
|
||||
latency.
|
||||
|
||||
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, max 1,536 threads per SM):
|
||||
Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
|
||||
Below the cliff, adding registers has zero occupancy cost.
|
||||
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ------------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (RRect only) | 3,276 | 1,536 | 100% |
|
||||
| 32 regs | 2,048 | 1,536 | 100% |
|
||||
| 48 regs (+ drop shadow) | 1,365 | 1,365 | ~89% |
|
||||
| 72 regs (+ frosted glass) | 910 | 910 | ~59% |
|
||||
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
|
||||
|
||||
On Volta/A100 GPUs (max 2,048 threads per SM):
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ----------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (main pipeline) | 3,276 | 1,536 | 100% |
|
||||
| 32 regs | 2,048 | 1,536 | 100% |
|
||||
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
|
||||
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ------------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (RRect only) | 3,276 | 2,048 | 100% |
|
||||
| 32 regs | 2,048 | 2,048 | 100% |
|
||||
| 48 regs (+ drop shadow) | 1,365 | 1,365 | ~67% |
|
||||
| 72 regs (+ frosted glass) | 910 | 910 | ~44% |
|
||||
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
|
||||
|
||||
The register cliff — where occupancy begins dropping — starts at ~43 regs/thread on consumer
|
||||
Ampere/Ada (65536 / 1536) and ~32 regs/thread on Volta/A100 (65536 / 2048). Below the cliff,
|
||||
adding registers has zero occupancy cost.
|
||||
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
|
||||
| ----------------------- | ------------------- | ------------------ | --------- |
|
||||
| 20 regs (main pipeline) | 3,276 | 2,048 | 100% |
|
||||
| 32 regs | 2,048 | 2,048 | 100% |
|
||||
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
|
||||
|
||||
The impact of reduced occupancy depends on whether the shader is memory-latency-bound (where
|
||||
occupancy is critical for hiding latency) or ALU-bound (where it matters less). For the
|
||||
backdrop-effects pipeline's frosted-glass shader, which performs multiple dependent texture reads,
|
||||
59% occupancy (consumer) or 44% occupancy (Volta) meaningfully reduces the GPU's ability to hide
|
||||
texture latency — roughly a 1.7× to 2.3× throughput reduction compared to full occupancy. At 4K with
|
||||
1.5× overdraw (~12.4M fragments), if the main pipeline's fragment work at full occupancy takes ~2ms,
|
||||
a single unified shader containing the glass branch would push it to ~3.4–4.6ms depending on
|
||||
architecture. This is a per-frame multiplier, not a per-primitive cost — it applies even when the
|
||||
heavy branch is never taken, because the compiler allocates registers for the worst-case path.
|
||||
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
|
||||
|
||||
**Note on Apple M3+ GPUs:** Apple's M3 GPU architecture introduces Dynamic Caching (register file
|
||||
virtualization), which allocates registers dynamically at runtime based on actual usage rather than
|
||||
worst-case declared usage. This significantly reduces the static register-pressure-to-occupancy
|
||||
penalty described above. The tier split remains useful on Apple hardware for other reasons (keeping
|
||||
the backdrop texture-copy out of the main render pass, isolating blur ALU complexity), but the
|
||||
register-pressure argument specifically weakens on M3 and later.
|
||||
| Register allocation | Occupancy |
|
||||
| -------------------- | -------------------------- |
|
||||
| 0–32 regs (main) | 100% (full thread count) |
|
||||
| 33–64 regs (effects) | ~50% (thread count halves) |
|
||||
|
||||
The three-pipeline split groups primitives by register footprint so that:
|
||||
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
|
||||
20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
|
||||
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
|
||||
pipeline's register cost.
|
||||
|
||||
- Main pipeline (~20 regs): all fragments run at full occupancy on every architecture.
|
||||
- Effects pipeline (~48–55 regs): shadow/glow fragments run at 67–89% occupancy depending on
|
||||
architecture; unavoidable given the blur math complexity.
|
||||
- Backdrop-effects pipeline (~72–75 regs): glass fragments run at 44–59% occupancy; also
|
||||
unavoidable, and structurally separated anyway by the texture-copy requirement.
|
||||
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
|
||||
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
|
||||
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
|
||||
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
|
||||
compiler allocates registers for the worst-case path.
|
||||
|
||||
This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
|
||||
(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
|
||||
actually need them. Crucially, all shape kinds within the main pipeline (SDF, tessellated, text)
|
||||
cluster at 12–24 registers — well below the register cliff on every architecture — so unifying them
|
||||
costs nothing in occupancy.
|
||||
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
|
||||
12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
|
||||
occupancy.
|
||||
|
||||
**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
|
||||
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
|
||||
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
|
||||
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
|
||||
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
|
||||
|
||||
#### Backdrop split: render-pass structure
|
||||
|
||||
The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
|
||||
reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
|
||||
render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
|
||||
level operation that requires ending the current render pass. This boundary exists regardless of
|
||||
shader complexity and cannot be optimized away.
|
||||
|
||||
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
|
||||
register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
|
||||
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
|
||||
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
|
||||
|
||||
#### Why not per-primitive-type pipelines (GPUI's approach)
|
||||
|
||||
Zed's GPUI uses 7 separate shader pairs:
|
||||
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
|
||||
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
|
||||
for our use case:
|
||||
@@ -151,7 +159,7 @@ typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kin
|
||||
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
|
||||
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
|
||||
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
|
||||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
|
||||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
|
||||
negligible (all members cluster at 12–22 registers).
|
||||
|
||||
**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
|
||||
@@ -190,8 +198,8 @@ in submission order:
|
||||
~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
|
||||
cheaper than the pipeline-switching alternative.
|
||||
|
||||
The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
|
||||
boundaries where occupancy drops are significant at 4K (see numbers above). Within a tier, unified is
|
||||
The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
|
||||
and structural render-pass requirements (see analysis above). Within a pipeline, unified is
|
||||
strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
|
||||
negligible GPU-side branching cost.
|
||||
|
||||
@@ -483,25 +491,40 @@ Wallace's variant) and vger-rs.
|
||||
- Vello's implementation of blurred rounded rectangle as a gradient type:
|
||||
https://github.com/linebender/vello/pull/665
|
||||
|
||||
### Backdrop-effects pipeline
|
||||
### Backdrop pipeline
|
||||
|
||||
The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
|
||||
glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
|
||||
reasons:
|
||||
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
|
||||
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
|
||||
register pressure.
|
||||
|
||||
1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
|
||||
target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
|
||||
buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
|
||||
boundary.
|
||||
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
|
||||
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
|
||||
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
|
||||
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
|
||||
while also writing to it.
|
||||
|
||||
2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
|
||||
(multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
|
||||
effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
|
||||
measurable throughput on the common case.
|
||||
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
||||
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
||||
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
||||
pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
|
||||
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
|
||||
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
|
||||
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
|
||||
|
||||
The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
|
||||
no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
|
||||
happens — zero cost.
|
||||
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
||||
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
||||
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
|
||||
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
||||
is never entered and the texture copy never happens — zero cost.
|
||||
|
||||
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
|
||||
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
|
||||
that justifies the main/effects split does not apply here. The main/effects split protects the
|
||||
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
|
||||
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
|
||||
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
|
||||
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
|
||||
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
|
||||
|
||||
### Vertex layout
|
||||
|
||||
@@ -524,19 +547,21 @@ The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex
|
||||
|
||||
```
|
||||
Primitive :: struct {
|
||||
kind: Shape_Kind, // 0: enum u8
|
||||
flags: Shape_Flags, // 1: bit_set[Shape_Flag; u8]
|
||||
_pad: u16, // 2: reserved
|
||||
bounds: [4]f32, // 4: min_x, min_y, max_x, max_y
|
||||
color: Color, // 20: u8x4
|
||||
_pad2: [3]u8, // 24: alignment
|
||||
params: Shape_Params, // 28: raw union, 32 bytes
|
||||
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
|
||||
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
|
||||
kind_flags: u32, // 20: (kind as u32) | (flags as u32 << 8)
|
||||
rotation: f32, // 24: shader self-rotation in radians
|
||||
_pad: f32, // 28: alignment
|
||||
params: Shape_Params, // 32: raw union, 32 bytes (two vec4s of shape-specific data)
|
||||
uv_rect: [4]f32, // 64: texture UV sub-region (u_min, v_min, u_max, v_max)
|
||||
}
|
||||
// Total: 60 bytes (padded to 64 for GPU alignment)
|
||||
// Total: 80 bytes (std430 aligned)
|
||||
```
|
||||
|
||||
`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
|
||||
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
|
||||
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side. The
|
||||
`uv_rect` field is used by textured SDF primitives (Shape_Flag.Textured); non-textured primitives
|
||||
leave it zeroed.
|
||||
|
||||
### Draw submission order
|
||||
|
||||
@@ -547,7 +572,7 @@ Within each scissor region, draws are issued in submission order to preserve the
|
||||
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
|
||||
shapes, indexed for text). Pipeline state unchanged from today.
|
||||
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
|
||||
4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
|
||||
4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
|
||||
backdrop primitives.
|
||||
|
||||
The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
|
||||
@@ -647,7 +672,7 @@ register-pressure analysis from the pipeline-strategy section above shows this i
|
||||
so both run at 100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF
|
||||
evaluation) amounts to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline
|
||||
would add ~1–5μs per pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side
|
||||
savings. Within the main tier, unified remains strictly better.
|
||||
savings. Within the main pipeline, unified remains strictly better.
|
||||
|
||||
The naming convention follows the existing shape API: `rectangle_texture` and
|
||||
`rectangle_texture_corners` sit alongside `rectangle` and `rectangle_corners`, mirroring the
|
||||
@@ -725,6 +750,35 @@ The following are plumbed in the descriptor but not implemented in phase 1:
|
||||
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
|
||||
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
|
||||
|
||||
## Multi-window support
|
||||
|
||||
The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
|
||||
deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
|
||||
(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.
|
||||
|
||||
RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
|
||||
that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
|
||||
it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
|
||||
the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
|
||||
per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
|
||||
(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
|
||||
bucket, so different windows can have independent clip and transform state without interference.
|
||||
|
||||
The key properties this gives RAD:
|
||||
|
||||
- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
|
||||
No global contention.
|
||||
- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
|
||||
build buckets concurrently, then submit them to the render backend sequentially.
|
||||
- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
|
||||
bucket with a transform applied, without rebuilding its draw calls.
|
||||
|
||||
For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
|
||||
context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
|
||||
style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
|
||||
global (shared across windows), with only the per-frame staging buffers and layer/scissor state
|
||||
becoming per-context.
|
||||
|
||||
## Building shaders
|
||||
|
||||
GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
|
||||
|
||||
Reference in New Issue
Block a user