Added backdrop effects pipeline (blur)

2026-04-28 22:12:25 -07:00
parent ff29dbd92f
commit 16989cbb71
29 changed files with 2931 additions and 415 deletions
@@ -52,9 +52,12 @@ statically allocates registers for the worst-case path (Ring_Arc) regardless of
 fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
 a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).

-MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
-benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
-glyph edges and tessellated user geometry if desired.
+MSAA is intentionally not supported. SDF text and shapes compute fragment coverage analytically
+via `smoothstep`, so they don't benefit from multisampling. Tessellated user geometry submitted via
+`prepare_shape` is rendered without anti-aliasing — if AA is required for tessellated content, the
+caller must render it to their own offscreen target and submit the result as a texture. This
+decision matches RAD Debugger's architecture and aligns with the SBC target (Mali Valhall, where
+MSAA's per-tile bandwidth multiplier is expensive).

 ## 2D rendering pipeline plan

@@ -249,9 +252,9 @@ API where each layer draws shadows before quads before glyphs. Our design avoids
 submission order is draw order, no layer juggling required.

 **PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
-first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
-variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
-additional axis with 7 pipelines vs 3.
+first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (blend
+modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per additional
+axis with 7 pipelines vs 3.

 **Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
 the strongest candidate for per-kind splitting because effect branches are heavier than shape
@@ -587,27 +590,29 @@ Wallace's variant) and vger-rs.
 ### Backdrop pipeline

 The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
-refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
-register pressure.
+refraction, mirror surfaces. It is separated from the main and effects pipelines for a structural
+reason, not register pressure.

 **Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
-must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
-operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
-amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
-while also writing to it.
+must be in a sampler-readable state. A draw call that samples the render target it is also writing
+to is a hard GPU constraint; the only way to satisfy it is to end the current render pass and start
+a new one. That render-pass boundary is what a “bracket” is.

 **Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
-(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
+(downsample → horizontal blur → vertical-blur+composite), following the standard approach used by
 iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
 sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
 multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
 require, keeping each sub-pass well under the 32-register cliff.

-**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
-buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
-resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
-regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
-is never entered and the texture copy never happens — zero cost.
+**Approach B: render-target choice.** When any layer in the frame contains a backdrop draw, the
+entire frame renders into `source_texture` (a full-resolution single-sample texture owned by the
+backdrop pipeline) instead of directly into the swapchain. At the end of the frame, `source_texture`
+is copied to the swapchain via a single `CopyGPUTextureToTexture` call. This means the bracket has
+no mid-frame texture copy: by the time the bracket runs, `source_texture` already contains the pre-
+bracket frame contents and is the natural sampler input. When no layer in the frame has a backdrop
+draw, the existing fast path runs: the frame renders directly to the swapchain and the backdrop
+pipeline's working textures are never touched. Zero cost for backdrop-free frames.

 **Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
 registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
@@ -617,6 +622,75 @@ all. Additionally, backdrop effects cover a small fraction of the frame's total
 typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
 bracket would have negligible impact on frame time.

+#### Bracket scheduling model
+
+The bracket is scheduled per layer, anchored at the first backdrop sub-batch in the layer's
+submission order. Concretely, a layer with one or more backdrops splits into three groups:
+
+1. **Pass A (pre-bracket)** — every non-backdrop sub-batch with index `< bracket_start_index`.
+   Renders to `source_texture` in a single render pass.
+2. **The bracket** — every backdrop sub-batch in the layer (regardless of index). Runs one
+   downsample pass, then one (H-blur + V-composite) pass pair per unique sigma.
+3. **Pass B (post-bracket)** — every non-backdrop sub-batch with index `>= bracket_start_index`.
+   Renders to `source_texture` with `LOAD`, drawing on top of the composited backdrop output.
+
+`bracket_start_index` is the absolute index of the first `.Backdrop` kind in the layer's sub-batch
+range. If the layer has no backdrops, none of this kicks in and the layer renders in a single render
+pass via the existing fast path.
+
+The downsample runs once per layer, not once per sigma: it just copies `source_texture` to a ¼-
+resolution working texture and doesn't depend on the kernel. Each unique sigma in the layer triggers
+one H-blur (reads `downsample_texture`, writes `h_blur_texture`) and one V-composite (reads
+`h_blur_texture`, writes `source_texture` per-primitive with the SDF mask). Sub-batch coalescing in
+`append_or_extend_sub_batch` merges contiguous same-sigma backdrops into a single instanced V-
+composite draw call; non-contiguous same-sigma backdrops still share the H-blur output but issue
+separate V-composite draws.
+
+#### Submission-order trade-off
+
+Within Pass A and Pass B, sub-batches render in the user's submission order. What the bracket model
+sacrifices is *interleaved* ordering between backdrop and non-backdrop content within a single
+layer. A non-backdrop sub-batch submitted between two backdrops still renders in Pass B (after the
+bracket), not at its submission position. Worked example:
+
+```
+draw.rectangle(layer, bg, GRAY)              // 0  Tessellated  → Pass A
+draw.rectangle(layer, card_blue, BLUE)       // 1  SDF          → Pass A
+draw.rectangle_backdrop(layer, panelA, 12)   // 2  Backdrop     → Bracket  (sees: bg + blue card)
+draw.rectangle(layer, card_red, RED)         // 3  SDF          → Pass B   (drawn ON TOP of panelA)
+draw.rectangle_backdrop(layer, panelB, 12)   // 4  Backdrop     → Bracket  (sees: bg + blue card; same as panelA)
+draw.text(layer, "label", ...)               // 5  Text         → Pass B   (drawn ON TOP of both panels)
+```
+
+In this layer, panelB does *not* see card_red — even though card_red was submitted before panelB —
+because both backdrops sample `source_texture` as it stood at the bracket entry, which is after
+Pass A and before card_red has rendered. card_red ends up on top of panelA, not underneath it.
+
+The user controls the alternative outcome by splitting layers. Putting card_red and panelB into a
+new layer (via `draw.new_layer`) gives panelB a fresh source snapshot that includes panelA and
+card_red:
+
+```
+base := draw.begin(...)
+draw.rectangle(base, bg, GRAY)
+draw.rectangle(base, card_blue, BLUE)
+draw.rectangle_backdrop(base, panelA, 12)    // panelA in base layer's bracket
+
+top := draw.new_layer(base, ...)
+draw.rectangle(top, card_red, RED)
+draw.rectangle_backdrop(top, panelB, 12)     // top layer's bracket; sees base + card_red
+draw.text(top, "label", ...)
+```
+
+Why one bracket per layer and not one per backdrop? Each bracket adds three render passes
+(downsample + H-blur + V-composite) and at least three tile-cache flushes on tilers like Mali
+Valhall. Strict submission-order semantics would require one bracket per cluster of contiguous
+backdrops, which scales the GPU cost linearly with how interleaved the user's submission happens
+to be — a footgun. The current design caps the bracket cost per layer regardless of submission
+interleave, and gives the user explicit control over ordering through the existing layer
+abstraction. This matches the cost/complexity envelope of iOS `UIVisualEffectView` and CSS
+`backdrop-filter` (both of which constrain backdrop ordering implicitly).
+
 ### Vertex layout

 The vertex struct is unchanged from the current 20-byte layout: