Basic texture support

2026-04-21 13:01:02 -07:00
parent f85187eff3
commit a4623a13b5
17 changed files with 1375 additions and 216 deletions
--- a/draw/README.md
+++ b/draw/README.md
@@ -47,99 +47,107 @@ primitives and effects can be added to the library without architectural changes

 ### Overview: three pipelines

-The 2D renderer will use three GPU pipelines, split by **register pressure compatibility** and
-**render-state requirements**:
+The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
+**render-pass structure** (everything vs backdrop):

-1. **Main pipeline** — shapes (SDF and tessellated) and text. Low register footprint (~18–22
-   registers per thread). Runs at high GPU occupancy. Handles 90%+ of all fragments in a typical
-   frame.
+1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
+   footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
+   Handles 90%+ of all fragments in a typical frame.

 2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
   effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
-   redundant overdraw.
+   redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
+   low-end hardware (see register analysis below).

-3. **Backdrop-effects pipeline** — frosted glass, refraction, and any effect that samples the current
-   render target as input. High register footprint (~70–80 registers) and structurally requires a
-   `CopyGPUTextureToTexture` from the render target before drawing. Separated both for register
-   pressure and because the texture-copy requirement forces a render-pass-level state change.
+3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
+   target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
+   where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
+   from the other pipelines because it structurally requires ending the current render pass and
+   copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
+   level boundary that cannot be avoided regardless of shader complexity.

 A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
 uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
-switches plus 1 texture copy. At ~5μs per pipeline bind on modern APIs, worst-case switching overhead
-is under 0.15% of an 8.3ms (120 FPS) frame budget.
+switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching
+overhead is negligible relative to an 8.3ms (120 FPS) frame budget.

 ### Why three pipelines, not one or seven

 The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
 code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).

-The dominant cost factor is **GPU register pressure**, not pipeline switching overhead or fragment
-shader branching. A GPU shader core has a fixed register pool shared among all concurrent threads. The
-compiler allocates registers pessimistically based on the worst-case path through the shader. If the
-shader contains both a 20-register RRect SDF and a 72-register frosted-glass blur, _every_ fragment
-— even trivial RRects — is allocated 72 registers. This directly reduces **occupancy** (the number of
-warps that can run simultaneously), which reduces the GPU's ability to hide memory latency.
+#### Main/effects split: register pressure

-Concrete occupancy analysis on modern NVIDIA SMs, which have 65,536 32-bit registers and a
-hardware-imposed maximum thread count per SM that varies by architecture (Volta/A100: 2,048;
-consumer Ampere/Ada: 1,536). Occupancy is register-limited only when `65536 / regs_per_thread` falls
-below the hardware thread cap; above that cap, occupancy is 100% regardless of register count.
+A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
+allocates registers pessimistically based on the worst-case path through the shader. If the shader
+contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
+trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
+warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
+latency.

-On consumer Ampere/Ada GPUs (RTX 30xx/40xx, max 1,536 threads per SM):
+Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
+Below the cliff, adding registers has zero occupancy cost.

-| Register allocation       | Reg-limited threads | Actual (hw-capped) | Occupancy |
-| ------------------------- | ------------------- | ------------------ | --------- |
-| 20 regs (RRect only)      | 3,276               | 1,536              | 100%      |
-| 32 regs                   | 2,048               | 1,536              | 100%      |
-| 48 regs (+ drop shadow)   | 1,365               | 1,365              | ~89%      |
-| 72 regs (+ frosted glass) | 910                 | 910                | ~59%      |
+On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):

-On Volta/A100 GPUs (max 2,048 threads per SM):
+| Register allocation     | Reg-limited threads | Actual (hw-capped) | Occupancy |
+| ----------------------- | ------------------- | ------------------ | --------- |
+| 20 regs (main pipeline) | 3,276               | 1,536              | 100%      |
+| 32 regs                 | 2,048               | 1,536              | 100%      |
+| 48 regs (effects)       | 1,365               | 1,365              | ~89%      |

-| Register allocation       | Reg-limited threads | Actual (hw-capped) | Occupancy |
-| ------------------------- | ------------------- | ------------------ | --------- |
-| 20 regs (RRect only)      | 3,276               | 2,048              | 100%      |
-| 32 regs                   | 2,048               | 2,048              | 100%      |
-| 48 regs (+ drop shadow)   | 1,365               | 1,365              | ~67%      |
-| 72 regs (+ frosted glass) | 910                 | 910                | ~44%      |
+On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):

-The register cliff — where occupancy begins dropping — starts at ~43 regs/thread on consumer
-Ampere/Ada (65536 / 1536) and ~32 regs/thread on Volta/A100 (65536 / 2048). Below the cliff,
-adding registers has zero occupancy cost.
+| Register allocation     | Reg-limited threads | Actual (hw-capped) | Occupancy |
+| ----------------------- | ------------------- | ------------------ | --------- |
+| 20 regs (main pipeline) | 3,276               | 2,048              | 100%      |
+| 32 regs                 | 2,048               | 2,048              | 100%      |
+| 48 regs (effects)       | 1,365               | 1,365              | ~67%      |

-The impact of reduced occupancy depends on whether the shader is memory-latency-bound (where
-occupancy is critical for hiding latency) or ALU-bound (where it matters less). For the
-backdrop-effects pipeline's frosted-glass shader, which performs multiple dependent texture reads,
-59% occupancy (consumer) or 44% occupancy (Volta) meaningfully reduces the GPU's ability to hide
-texture latency — roughly a 1.7× to 2.3× throughput reduction compared to full occupancy. At 4K with
-1.5× overdraw (~12.4M fragments), if the main pipeline's fragment work at full occupancy takes ~2ms,
-a single unified shader containing the glass branch would push it to ~3.4–4.6ms depending on
-architecture. This is a per-frame multiplier, not a per-primitive cost — it applies even when the
-heavy branch is never taken, because the compiler allocates registers for the worst-case path.
+On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):

-**Note on Apple M3+ GPUs:** Apple's M3 GPU architecture introduces Dynamic Caching (register file
-virtualization), which allocates registers dynamically at runtime based on actual usage rather than
-worst-case declared usage. This significantly reduces the static register-pressure-to-occupancy
-penalty described above. The tier split remains useful on Apple hardware for other reasons (keeping
-the backdrop texture-copy out of the main render pass, isolating blur ALU complexity), but the
-register-pressure argument specifically weakens on M3 and later.
+| Register allocation  | Occupancy                  |
+| -------------------- | -------------------------- |
+| 0–32 regs (main)     | 100% (full thread count)   |
+| 33–64 regs (effects) | ~50% (thread count halves) |

-The three-pipeline split groups primitives by register footprint so that:
+Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
+20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
+main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
+pipeline's register cost.

- Main pipeline (~20 regs): all fragments run at full occupancy on every architecture.
- Effects pipeline (~48–55 regs): shadow/glow fragments run at 67–89% occupancy depending on
-  architecture; unavoidable given the blur math complexity.
- Backdrop-effects pipeline (~72–75 regs): glass fragments run at 44–59% occupancy; also
-  unavoidable, and structurally separated anyway by the texture-copy requirement.
+For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
+fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
+fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
+low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
+compiler allocates registers for the worst-case path.

-This avoids the register-pressure tax of a single unified shader while keeping pipeline count minimal
-(3 vs. Zed GPUI's 7). The effects that drag occupancy down are isolated to the fragments that
-actually need them. Crucially, all shape kinds within the main pipeline (SDF, tessellated, text)
-cluster at 12–24 registers — well below the register cliff on every architecture — so unifying them
-costs nothing in occupancy.
+All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
+12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
+occupancy.

-**Why not per-primitive-type pipelines (GPUI's approach)?** Zed's GPUI uses 7 separate shader pairs:
+**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
+which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
+static register-pressure argument on M3 and later, but the split remains useful for isolating blur
+ALU complexity and keeping the backdrop texture-copy out of the main render pass.
+
+#### Backdrop split: render-pass structure
+
+The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
+reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
+render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
+level operation that requires ending the current render pass. This boundary exists regardless of
+shader complexity and cannot be optimized away.
+
+The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
+register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
+problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
+inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
+
+#### Why not per-primitive-type pipelines (GPUI's approach)
+
+Zed's GPUI uses 7 separate shader pairs:
 quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
 branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
 for our use case:
@@ -151,7 +159,7 @@ typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kin
 ~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
 pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
 375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
-compensating GPU-side benefit, because the register-pressure savings within the simple-SDF tier are
+compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
 negligible (all members cluster at 12–22 registers).

 **Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
@@ -190,8 +198,8 @@ in submission order:
  ~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
  cheaper than the pipeline-switching alternative.

-The split we _do_ perform (main / effects / backdrop-effects) is motivated by register-pressure tier
-boundaries where occupancy drops are significant at 4K (see numbers above). Within a tier, unified is
+The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
+and structural render-pass requirements (see analysis above). Within a pipeline, unified is
 strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
 negligible GPU-side branching cost.

@@ -483,25 +491,40 @@ Wallace's variant) and vger-rs.
 - Vello's implementation of blurred rounded rectangle as a gradient type:
  https://github.com/linebender/vello/pull/665

-### Backdrop-effects pipeline
+### Backdrop pipeline

-The backdrop-effects pipeline handles effects that sample the current render target as input: frosted
-glass, refraction, mirror surfaces. It is structurally separated from the effects pipeline for two
-reasons:
+The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
+refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
+register pressure.

-1. **Render-state requirement.** Before any backdrop-sampling fragment can run, the current render
-   target must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-
-   buffer-level operation that cannot happen mid-render-pass. The copy naturally creates a pipeline
-   boundary.
+**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
+must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
+operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
+amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
+while also writing to it.

-2. **Register pressure.** Backdrop-sampling shaders read from a texture with Gaussian kernel weights
-   (multiple texture fetches per fragment), pushing register usage to ~70–80. Including this in the
-   effects pipeline would reduce occupancy for all shadow/glow fragments from ~30% to ~20%, costing
-   measurable throughput on the common case.
+**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
+(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
+iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
+pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
+occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
+Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
+Mali-G31 and VideoCore VI) where per-thread register limits are tight.

-The backdrop-effects pipeline binds a secondary sampler pointing at the captured backdrop texture. When
-no backdrop effects are present in a frame, this pipeline is never bound and the texture copy never
-happens — zero cost.
+**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
+buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
+resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
+regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
+is never entered and the texture copy never happens — zero cost.
+
+**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
+~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
+that justifies the main/effects split does not apply here. The main/effects split protects the
+_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
+pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
+runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
+Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
+UI scales), so the occupancy variation within the bracket has negligible impact on frame time.

 ### Vertex layout

@@ -524,19 +547,21 @@ The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex

 ```
 Primitive :: struct {
-    kind:   Shape_Kind,     //  0: enum u8
-    flags:  Shape_Flags,    //  1: bit_set[Shape_Flag; u8]
-    _pad:   u16,            //  2: reserved
-    bounds: [4]f32,         //  4: min_x, min_y, max_x, max_y
-    color:  Color,          // 20: u8x4
-    _pad2:  [3]u8,          // 24: alignment
-    params: Shape_Params,   // 28: raw union, 32 bytes
+    bounds:     [4]f32,         //  0: min_x, min_y, max_x, max_y
+    color:      Color,          // 16: u8x4, unpacked in shader via unpackUnorm4x8
+    kind_flags: u32,            // 20: (kind as u32) | (flags as u32 << 8)
+    rotation:   f32,            // 24: shader self-rotation in radians
+    _pad:       f32,            // 28: alignment
+    params:     Shape_Params,   // 32: raw union, 32 bytes (two vec4s of shape-specific data)
+    uv_rect:    [4]f32,         // 64: texture UV sub-region (u_min, v_min, u_max, v_max)
 }
-// Total: 60 bytes (padded to 64 for GPU alignment)
+// Total: 80 bytes (std430 aligned)
 ```

 `Shape_Params` is a `#raw_union` with named variants per shape kind (`rrect`, `circle`, `segment`,
-etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side.
+etc.), ensuring type safety on the CPU side and zero-cost reinterpretation on the GPU side. The
+`uv_rect` field is used by textured SDF primitives (Shape_Flag.Textured); non-textured primitives
+leave it zeroed.

 ### Draw submission order

@@ -547,7 +572,7 @@ Within each scissor region, draws are issued in submission order to preserve the
 2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
   shapes, indexed for text). Pipeline state unchanged from today.
 3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
-4. If backdrop effects are present: copy render target, bind **backdrop-effects pipeline** → draw
+4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
   backdrop primitives.

 The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
@@ -647,7 +672,7 @@ register-pressure analysis from the pipeline-strategy section above shows this i
 so both run at 100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF
 evaluation) amounts to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline
 would add ~1–5μs per pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side
-savings. Within the main tier, unified remains strictly better.
+savings. Within the main pipeline, unified remains strictly better.

 The naming convention follows the existing shape API: `rectangle_texture` and
 `rectangle_texture_corners` sit alongside `rectangle` and `rectangle_corners`, mirroring the
@@ -725,6 +750,35 @@ The following are plumbed in the descriptor but not implemented in phase 1:
 expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
 sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.

+## Multi-window support
+
+The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
+deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
+(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.
+
+RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
+that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
+it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
+the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
+per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
+(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
+bucket, so different windows can have independent clip and transform state without interference.
+
+The key properties this gives RAD:
+
+- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
+  No global contention.
+- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
+  build buckets concurrently, then submit them to the render backend sequentially.
+- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
+  bucket with a transform applied, without rebuilding its draw calls.
+
+For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
+context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
+style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
+global (shared across windows), with only the per-frame staging buffers and layer/scissor state
+becoming per-context.
+
 ## Building shaders

 GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)