Basic texture support

This commit is contained in:
Zachary Levy
2026-04-21 13:01:02 -07:00
parent f85187eff3
commit a4623a13b5
17 changed files with 1375 additions and 216 deletions

View File

@@ -47,99 +47,107 @@ primitives and effects can be added to the library without architectural changes
### Overview: three pipelines
The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
**render-state requirements**:
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
**render-pass structure** (everything vs backdrop):
1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~1822
registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
frame.
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
footprint (~1824 registers per thread). Runs at full GPU occupancy on every architecture.
Handles 90%+ of all fragments in a typical frame.
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
effects. Medium register footprint (~4860 registers). Each effects primitive includes the base
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
redundant overdraw.
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
low-end hardware (see register analysis below).
3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
render target as input. High register footprint (~7080 registers) and structurally requires a
`CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
pressure and because the texture-copy requirement forces a render-pass-level state change.
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
where each individual pass has a low-to-medium register footprint (~1540 registers). Separated
from the other pipelines because it structurally requires ending the current render pass and
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
level boundary that cannot be avoided regardless of shader complexity.
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
is under 0.15% of an 8.3ms (120 FPS) frame budget.
switches plus 1 texture copy. At ~15μs per pipeline bind on modern APIs, worst-case switching
overhead is negligible relative to an 8.3ms (120 FPS) frame budget.
### Why three pipelines, not one or seven
The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
compiler allocates registers pessimistically based on the worst-case path through the shader. If the
shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
#### Main/effects split: register pressure
Concrete occupancy analysis on modern NVIDIA SMs, which have 65,536 32-bit registers and a
hardware-imposed maximum thread count per SM that varies by architecture (Volta/A100: 2,048;
consumer Ampere/Ada: 1,536). Occupancy is register-limited only when `65536 / regs_per_thread` falls
below the hardware thread cap; above that cap, occupancy is 100% regardless of register count.
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
allocates registers pessimistically based on the worst-case path through the shader. If the shader
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
latency.
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, max 1,536 threads per SM):
Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
Below the cliff, adding registers has zero occupancy cost.
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------- | ------------------- | ------------------ | --------- |
| 20 regs (RRect only) | 3,276 | 1,536 | 100% |
| 32 regs | 2,048 | 1,536 | 100% |
| 48 regs (+ drop shadow) | 1,365 | 1,365 | ~89% |
| 72 regs (+ frosted glass) | 910 | 910 | ~59% |
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
On Volta/A100 GPUs (max 2,048 threads per SM):
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ----------------------- | ------------------- | ------------------ | --------- |
| 20 regs (main pipeline) | 3,276 | 1,536 | 100% |
| 32 regs | 2,048 | 1,536 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------- | ------------------- | ------------------ | --------- |
| 20 regs (RRect only) | 3,276 | 2,048 | 100% |
| 32 regs | 2,048 | 2,048 | 100% |
| 48 regs (+ drop shadow) | 1,365 | 1,365 | ~67% |
| 72 regs (+ frosted glass) | 910 | 910 | ~44% |
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
The register cliff — where occupancy begins dropping — starts at ~43 regs/thread on consumer
Ampere/Ada (65536 / 1536) and ~32 regs/thread on Volta/A100 (65536 / 2048). Below the cliff,
adding registers has zero occupancy cost.
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ----------------------- | ------------------- | ------------------ | --------- |
| 20 regs (main pipeline) | 3,276 | 2,048 | 100% |
| 32 regs | 2,048 | 2,048 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
The impact of reduced occupancy depends on whether the shader is memory-latency-bound (where
occupancy is critical for hiding latency) or ALU-bound (where it matters less). For the
backdrop-effects pipeline's frosted-glass shader, which performs multiple dependent texture reads,
59% occupancy (consumer) or 44% occupancy (Volta) meaningfully reduces the GPU's ability to hide
texture latency — roughly a 1.7× to 2.3× throughput reduction compared to full occupancy. At 4K with
1.5× overdraw (~12.4M fragments), if the main pipeline's fragment work at full occupancy takes ~2ms,
a single unified shader containing the glass branch would push it to ~3.44.6ms depending on
architecture. This is a per-frame multiplier, not a per-primitive cost — it applies even when the
heavy branch is never taken, because the compiler allocates registers for the worst-case path.
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
**Note on Apple M3+ GPUs:** Apple's M3 GPU architecture introduces Dynamic Caching (register file
virtualization), which allocates registers dynamically at runtime based on actual usage rather than
worst-case declared usage. This significantly reduces the static register-pressure-to-occupancy
penalty described above. The tier split remains useful on Apple hardware for other reasons (keeping
the backdrop texture-copy out of the main render pass, isolating blur ALU complexity), but the
register-pressure argument specifically weakens on M3 and later.
| Register allocation | Occupancy |
| -------------------- | -------------------------- |
| 032 regs (main) | 100% (full thread count) |
| 3364 regs (effects) | ~50% (thread count halves) |
The three-pipeline split groups primitives by register footprint so that:
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
20 and 48 registers is modest (89100%); on Mali it is a hard 2× throughput reduction. The
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
pipeline's register cost.
- Main pipeline (~20 regs): all fragments run at full occupancy on every architecture.
- Effects pipeline (~4855 regs): shadow/glow fragments run at 6789% occupancy depending on
architecture; unavoidable given the blur math complexity.
- Backdrop-effects pipeline (~7275 regs): glass fragments run at 4459% occupancy; also
unavoidable, and structurally separated anyway by the texture-copy requirement.
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
compiler allocates registers for the worst-case path.
This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
actually need them. Crucially, all shape kinds within the main pipeline (SDF, tessellated, text)
cluster at 1224 registers — well below the register cliff on every architecture — so unifying them
costs nothing in occupancy.
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
1224 registers — below the cliff on every architecture — so unifying them costs nothing in
occupancy.
**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
#### Backdrop split: render-pass structure
The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
level operation that requires ending the current render pass. This boundary exists regardless of
shader complexity and cannot be optimized away.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
register-light (~1540 regs each), so merging them into the effects pipeline would cause no occupancy
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
#### Why not per-primitive-type pipelines (GPUI's approach)
Zed's GPUI uses 7 separate shader pairs:
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
for our use case:
@@ -151,7 +159,7 @@ typical UI frame with 15 scissors and 34 primitive kinds per scissor, per-kin
~4560 draw calls and pipeline binds; our unified approach produces ~1520 draw calls and 15
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
375500μs of CPU overhead per frame — **4.56% of an 8.3ms (120 FPS) budget** — with no
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
negligible (all members cluster at 1222 registers).
**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
@@ -190,8 +198,8 @@ in submission order:
~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
cheaper than the pipeline-switching alternative.
The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
boundaries where occupancy drops are significant at 4K (see numbers above). Within a tier, unified is
The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
and structural render-pass requirements (see analysis above). Within a pipeline, unified is
strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
negligible GPU-side branching cost.
@@ -483,25 +491,40 @@ Wallace's variant) and vger-rs.
- Vello's implementation of blurred rounded rectangle as a gradient type:
https://github.com/linebender/vello/pull/665
### Backdrop-effects pipeline
### Backdrop pipeline
The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
reasons:
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
register pressure.
1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
boundary.
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
while also writing to it.
2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
(multiple texture fetches per fragment), pushing register usage to ~7080. Including this in the
effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
measurable throughput on the common case.
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
pass has a low-to-medium register footprint (~1540 registers), well within the main pipeline's
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
happens — zero cost.
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
is never entered and the texture copy never happens — zero cost.
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
that justifies the main/effects split does not apply here. The main/effects split protects the
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
### Vertex layout
@@ -524,19 +547,21 @@ The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex
```
Primitive :: struct {
kind: Shape_Kind, // 0: enum u8
flags: Shape_Flags, // 1: bit_set[Shape_Flag; u8]
_pad: u16, // 2: reserved
bounds: [4]f32, // 4: min_x, min_y, max_x, max_y
color: Color, // 20: u8x4
_pad2: [3]u8, // 24: alignment
params: Shape_Params, // 28: raw union, 32 bytes
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
kind_flags: u32, // 20: (kind as u32) | (flags as u32 << 8)
rotation: f32, // 24: shader self-rotation in radians
_pad: f32, // 28: alignment
params: Shape_Params, // 32: raw union, 32 bytes (two vec4s of shape-specific data)
uv_rect: [4]f32, // 64: texture UV sub-region (u_min, v_min, u_max, v_max)
}
// Total: 60 bytes (padded to 64 for GPU alignment)
// Total: 80 bytes (std430 aligned)
```
`Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side. The
`uv_rect` field is used by textured SDF primitives (Shape_Flag.Textured); non-textured primitives
leave it zeroed.
### Draw submission order
@@ -547,7 +572,7 @@ Within each scissor region, draws are issued in submission order to preserve the
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
shapes, indexed for text). Pipeline state unchanged from today.
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
backdrop primitives.
The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
@@ -647,7 +672,7 @@ register-pressure analysis from the pipeline-strategy section above shows this i
so both run at 100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF
evaluation) amounts to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline
would add ~15μs per pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side
savings. Within the main tier, unified remains strictly better.
savings. Within the main pipeline, unified remains strictly better.
The naming convention follows the existing shape API: `rectangle_texture` and
`rectangle_texture_corners` sit alongside `rectangle` and `rectangle_corners`, mirroring the
@@ -725,6 +750,35 @@ The following are plumbed in the descriptor but not implemented in phase 1:
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
## Multi-window support
The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.
RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
bucket, so different windows can have independent clip and transform state without interference.
The key properties this gives RAD:
- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
No global contention.
- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
build buckets concurrently, then submit them to the render backend sequentially.
- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
bucket with a transform applied, without rebuilding its draw calls.
For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
global (shared across windows), with only the per-frame staging buffers and layer/scissor state
becoming per-context.
## Building shaders
GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)