Added backdrop effects pipeline (blur)
This commit is contained in:
+92
-18
@@ -52,9 +52,12 @@ statically allocates registers for the worst-case path (Ring_Arc) regardless of
|
||||
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
|
||||
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
|
||||
|
||||
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
|
||||
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
|
||||
glyph edges and tessellated user geometry if desired.
|
||||
MSAA is intentionally not supported. SDF text and shapes compute fragment coverage analytically
|
||||
via `smoothstep`, so they don't benefit from multisampling. Tessellated user geometry submitted via
|
||||
`prepare_shape` is rendered without anti-aliasing — if AA is required for tessellated content, the
|
||||
caller must render it to their own offscreen target and submit the result as a texture. This
|
||||
decision matches RAD Debugger's architecture and aligns with the SBC target (Mali Valhall, where
|
||||
MSAA's per-tile bandwidth multiplier is expensive).
|
||||
|
||||
## 2D rendering pipeline plan
|
||||
|
||||
@@ -249,9 +252,9 @@ API where each layer draws shadows before quads before glyphs. Our design avoids
|
||||
submission order is draw order, no layer juggling required.
|
||||
|
||||
**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
|
||||
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
|
||||
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
|
||||
additional axis with 7 pipelines vs 3.
|
||||
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (blend
|
||||
modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per additional
|
||||
axis with 7 pipelines vs 3.
|
||||
|
||||
**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
|
||||
the strongest candidate for per-kind splitting because effect branches are heavier than shape
|
||||
@@ -587,27 +590,29 @@ Wallace's variant) and vger-rs.
|
||||
### Backdrop pipeline
|
||||
|
||||
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
|
||||
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
|
||||
register pressure.
|
||||
refraction, mirror surfaces. It is separated from the main and effects pipelines for a structural
|
||||
reason, not register pressure.
|
||||
|
||||
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
|
||||
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
|
||||
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
|
||||
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
|
||||
while also writing to it.
|
||||
must be in a sampler-readable state. A draw call that samples the render target it is also writing
|
||||
to is a hard GPU constraint; the only way to satisfy it is to end the current render pass and start
|
||||
a new one. That render-pass boundary is what a “bracket” is.
|
||||
|
||||
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
||||
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
||||
(downsample → horizontal blur → vertical-blur+composite), following the standard approach used by
|
||||
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
||||
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
|
||||
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
|
||||
require, keeping each sub-pass well under the 32-register cliff.
|
||||
|
||||
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
||||
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
||||
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
|
||||
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
||||
is never entered and the texture copy never happens — zero cost.
|
||||
**Approach B: render-target choice.** When any layer in the frame contains a backdrop draw, the
|
||||
entire frame renders into `source_texture` (a full-resolution single-sample texture owned by the
|
||||
backdrop pipeline) instead of directly into the swapchain. At the end of the frame, `source_texture`
|
||||
is copied to the swapchain via a single `CopyGPUTextureToTexture` call. This means the bracket has
|
||||
no mid-frame texture copy: by the time the bracket runs, `source_texture` already contains the pre-
|
||||
bracket frame contents and is the natural sampler input. When no layer in the frame has a backdrop
|
||||
draw, the existing fast path runs: the frame renders directly to the swapchain and the backdrop
|
||||
pipeline's working textures are never touched. Zero cost for backdrop-free frames.
|
||||
|
||||
**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
|
||||
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
|
||||
@@ -617,6 +622,75 @@ all. Additionally, backdrop effects cover a small fraction of the frame's total
|
||||
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
|
||||
bracket would have negligible impact on frame time.
|
||||
|
||||
#### Bracket scheduling model
|
||||
|
||||
The bracket is scheduled per layer, anchored at the first backdrop sub-batch in the layer's
|
||||
submission order. Concretely, a layer with one or more backdrops splits into three groups:
|
||||
|
||||
1. **Pass A (pre-bracket)** — every non-backdrop sub-batch with index `< bracket_start_index`.
|
||||
Renders to `source_texture` in a single render pass.
|
||||
2. **The bracket** — every backdrop sub-batch in the layer (regardless of index). Runs one
|
||||
downsample pass, then one (H-blur + V-composite) pass pair per unique sigma.
|
||||
3. **Pass B (post-bracket)** — every non-backdrop sub-batch with index `>= bracket_start_index`.
|
||||
Renders to `source_texture` with `LOAD`, drawing on top of the composited backdrop output.
|
||||
|
||||
`bracket_start_index` is the absolute index of the first `.Backdrop` kind in the layer's sub-batch
|
||||
range. If the layer has no backdrops, none of this kicks in and the layer renders in a single render
|
||||
pass via the existing fast path.
|
||||
|
||||
The downsample runs once per layer, not once per sigma: it just copies `source_texture` to a ¼-
|
||||
resolution working texture and doesn't depend on the kernel. Each unique sigma in the layer triggers
|
||||
one H-blur (reads `downsample_texture`, writes `h_blur_texture`) and one V-composite (reads
|
||||
`h_blur_texture`, writes `source_texture` per-primitive with the SDF mask). Sub-batch coalescing in
|
||||
`append_or_extend_sub_batch` merges contiguous same-sigma backdrops into a single instanced V-
|
||||
composite draw call; non-contiguous same-sigma backdrops still share the H-blur output but issue
|
||||
separate V-composite draws.
|
||||
|
||||
#### Submission-order trade-off
|
||||
|
||||
Within Pass A and Pass B, sub-batches render in the user's submission order. What the bracket model
|
||||
sacrifices is *interleaved* ordering between backdrop and non-backdrop content within a single
|
||||
layer. A non-backdrop sub-batch submitted between two backdrops still renders in Pass B (after the
|
||||
bracket), not at its submission position. Worked example:
|
||||
|
||||
```
|
||||
draw.rectangle(layer, bg, GRAY) // 0 Tessellated → Pass A
|
||||
draw.rectangle(layer, card_blue, BLUE) // 1 SDF → Pass A
|
||||
draw.rectangle_backdrop(layer, panelA, 12) // 2 Backdrop → Bracket (sees: bg + blue card)
|
||||
draw.rectangle(layer, card_red, RED) // 3 SDF → Pass B (drawn ON TOP of panelA)
|
||||
draw.rectangle_backdrop(layer, panelB, 12) // 4 Backdrop → Bracket (sees: bg + blue card; same as panelA)
|
||||
draw.text(layer, "label", ...) // 5 Text → Pass B (drawn ON TOP of both panels)
|
||||
```
|
||||
|
||||
In this layer, panelB does *not* see card_red — even though card_red was submitted before panelB —
|
||||
because both backdrops sample `source_texture` as it stood at the bracket entry, which is after
|
||||
Pass A and before card_red has rendered. card_red ends up on top of panelA, not underneath it.
|
||||
|
||||
The user controls the alternative outcome by splitting layers. Putting card_red and panelB into a
|
||||
new layer (via `draw.new_layer`) gives panelB a fresh source snapshot that includes panelA and
|
||||
card_red:
|
||||
|
||||
```
|
||||
base := draw.begin(...)
|
||||
draw.rectangle(base, bg, GRAY)
|
||||
draw.rectangle(base, card_blue, BLUE)
|
||||
draw.rectangle_backdrop(base, panelA, 12) // panelA in base layer's bracket
|
||||
|
||||
top := draw.new_layer(base, ...)
|
||||
draw.rectangle(top, card_red, RED)
|
||||
draw.rectangle_backdrop(top, panelB, 12) // top layer's bracket; sees base + card_red
|
||||
draw.text(top, "label", ...)
|
||||
```
|
||||
|
||||
Why one bracket per layer and not one per backdrop? Each bracket adds three render passes
|
||||
(downsample + H-blur + V-composite) and at least three tile-cache flushes on tilers like Mali
|
||||
Valhall. Strict submission-order semantics would require one bracket per cluster of contiguous
|
||||
backdrops, which scales the GPU cost linearly with how interleaved the user's submission happens
|
||||
to be — a footgun. The current design caps the bracket cost per layer regardless of submission
|
||||
interleave, and gives the user explicit control over ordering through the existing layer
|
||||
abstraction. This matches the cost/complexity envelope of iOS `UIVisualEffectView` and CSS
|
||||
`backdrop-filter` (both of which constrain backdrop ordering implicitly).
|
||||
|
||||
### Vertex layout
|
||||
|
||||
The vertex struct is unchanged from the current 20-byte layout:
|
||||
|
||||
Reference in New Issue
Block a user