Update draw README

This commit is contained in:
Zachary Levy
2026-04-28 16:04:44 -07:00
parent c59858dcd4
commit ff29dbd92f
+235 -176
View File
@@ -9,51 +9,53 @@ The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline)
modes dispatched by a push constant: modes dispatched by a push constant:
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into - **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`, SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
`tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle (`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors `tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
(`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`. shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive - **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader `Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners + reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
kind dispatch. `Primitive.flags`) to evaluate one of four signed distance functions:
- **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
- **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
- **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
- **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
The SDF path handles all shapes that are algebraically reducible to a rounded rectangle: All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked, parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
textured, and gradient-filled rectangles. — zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles. exclusive.
- **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
- **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
struct. Gradient and texture are mutually exclusive.
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep`
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes) no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
instead of ~250 vertices (~5000 bytes). instead of ~250 vertices (~5000 bytes).
The fragment shader's estimated register footprint is ~2023 VGPRs via static live-range analysis. The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land fragment shader's estimated peak footprint is ~2226 fp32 VGPRs (~1622 fp16 VGPRs on architectures
comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur (wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 12 regs lower (`corner_radii` vec4
Radeon GPU Analyzer against the compiled SPIR-V. replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
24 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
glyph edges and tessellated user geometry if desired. glyph edges and tessellated user geometry if desired.
All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
`sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
count).
## 2D rendering pipeline plan ## 2D rendering pipeline plan
This section documents the planned architecture for levlib's 2D rendering system. The design is driven This section documents the planned architecture for levlib's 2D rendering system. The design is driven
@@ -66,22 +68,23 @@ primitives and effects can be added to the library without architectural changes
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
**render-pass structure** (everything vs backdrop): **render-pass structure** (everything vs backdrop):
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register 1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
footprint (~1824 registers per thread). Runs at full GPU occupancy on every architecture. **≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
Handles 90%+ of all fragments in a typical frame. in a typical frame.
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur 2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
effects. Medium register footprint (~4860 registers). Each effects primitive includes the base effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
occupancy at the first cliff is accepted by design). Each effects primitive includes the base
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
low-end hardware (see register analysis below). low-end hardware (see register analysis below).
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render 3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite), target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
where each individual pass has a low-to-medium register footprint (~1540 registers). Separated where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
from the other pipelines because it structurally requires ending the current render pass and Valhall). Separated from the other pipelines because it structurally requires ending the current
copying the render target before any backdrop-sampling fragment can execute — a command-buffer- render pass and copying the render target before any backdrop-sampling fragment can execute — a
level boundary that cannot be avoided regardless of shader complexity. command-buffer-level boundary that cannot be avoided regardless of shader complexity.
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2 uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
@@ -97,56 +100,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
allocates registers pessimistically based on the worst-case path through the shader. If the shader allocates registers pessimistically based on the worst-case path through the shader. If the shader
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
latency. latency.
Each GPU architecture has a **register cliff**a threshold above which occupancy starts dropping. Each GPU architecture has discrete **occupancy cliffs**register counts above which the number of
Below the cliff, adding registers has zero occupancy cost. concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
register over, throughput drops sharply.
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs): **Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
dominant current GPU architecture:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy | - **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
| ------------------------ | ------------------- | ------------------ | --------- | **Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
| ~16 regs (main pipeline) | 4,096 | 1,536 | 100% | registers**, second cliff at **64 registers**.
| 32 regs | 2,048 | 1,536 | 100% | - **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
| 48 regs (effects) | 1,365 | 1,365 | ~89% | same cliff structure — first at 32, second at 64.
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~20162018): first cliff at **16 registers**.
Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
below.
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
the current SBC market. See Known limitations below.
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
Register allocation happens at runtime based on actual usage.
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs): **Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
registers of margin**:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy | | Pipeline | Cliff targeted | Margin | Register budget | Rationale |
| ------------------------ | ------------------- | ------------------ | --------- | | ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
| ~16 regs (main pipeline) | 4,096 | 2,048 | 100% | | Main pipeline | 32 (Valhall 1st cliff) | 8 | **≤24 regs** | Handles 90%+ of frame fragments; must run at full occupancy |
| 32 regs | 2,048 | 2,048 | 100% | | Backdrop sub-passes | 32 (Valhall 1st cliff) | 8 | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy |
| 48 regs (effects) | 1,365 | 1,365 | ~67% | | Effects pipeline | 64 (Valhall 2nd cliff) | 8 | **≤56 regs** | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs): **Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
counts upward over a shader's lifetime:
| Register allocation | Occupancy | 1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
| -------------------- | -------------------------- | allocators. Shaders typically drift ±23 registers between versions on unchanged source.
| 032 regs (main) | 100% (full thread count) | 2. **Feature additions.** Each new effect, flag, or uniform adds 14 live registers. A new gradient
| 3364 regs (effects) | ~50% (thread count halves) | mode or outline option lands in this range.
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
or a contributor not knowing) costs 2 registers per affected `vec4`.
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between Realistic creep over a couple of years is 48 registers. The cost of conservatism is zero — a shader
20 and 48 registers is modest (89100%); on Mali it is a hard 2× throughput reduction. The at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
pipeline's register cost.
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture **Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
actually render effects (~510% in a typical UI) run at reduced occupancy.
For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
compiler allocates registers for the worst-case path. compiler allocates registers for the worst-case path.
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
1224 registers — below the cliff on every architecture — so unifying them costs nothing in 5067% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
occupancy. that effects cover.
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization), **Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
static register-pressure argument on M3 and later, but the split remains useful for isolating blur later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
ALU complexity and keeping the backdrop texture-copy out of the main render pass. texture-copy out of the main render pass.
**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
#### Known limitations: V3D and Bifrost (16-register cliff)
Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
reduced occupancy regardless of which shape kind or effect is active.
Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
register footprint under 16). This conflicts with the unified-pipeline design that enables single
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
pipelines" below.
We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
allocation).
#### Verifying register counts
The register estimates in this document are hand-counted via manual live-range analysis (see Current
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
measured `malioc` numbers is a follow-up task.
#### Backdrop split: render-pass structure #### Backdrop split: render-pass structure
@@ -156,10 +216,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
level operation that requires ending the current render pass. This boundary exists regardless of level operation that requires ending the current render pass. This boundary exists regardless of
shader complexity and cannot be optimized away. shader complexity and cannot be optimized away.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
register-light (~1540 regs each), so merging them into the effects pipeline would cause no occupancy at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence. effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
pass sequence.
#### Why not per-primitive-type pipelines (GPUI's approach) #### Why not per-primitive-type pipelines (GPUI's approach)
@@ -271,18 +332,23 @@ There are three categories of branch condition in a fragment shader, ranked by c
#### Which category our branches fall into #### Which category our branches fall into
Our design has two branch points: Our design has three branch points:
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call. 1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
cost.** cost.**
2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3. 2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~1530
— all lightweight (38 instructions per path). Divergence at primitive boundaries between instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
different flag combinations has negligible cost. different kinds.
3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
solid color selection and outline rendering — all lightweight branches (38 instructions per
path). Divergence at primitive boundaries between different flag combinations has negligible cost.
For category 3, the divergence analysis depends on primitive size: For category 3, the divergence analysis depends on primitive size:
@@ -299,11 +365,12 @@ For category 3, the divergence analysis depends on primitive size:
frame-level divergence is typically **13%** of all warps. frame-level divergence is typically **13%** of all warps.
At 13% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments At 13% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the (~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function. is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU, of both (~4560 instructions total). Each divergent warp's extra cost is modest — at ~12G
that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
confirmed by production renderers that use exactly this pattern: ~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
that use exactly this pattern:
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a - **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
@@ -327,10 +394,10 @@ our design:
> have no per-fragment data-dependent branches in the main pipeline. > have no per-fragment data-dependent branches in the main pipeline.
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions, 2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates divergent warps pay double a large cost. Our SDF kind branches are short (~1530 instructions
`sdRoundedBox`; the only branches are gradient/texture/solid color selection at 38 instructions each), and the gradient/texture/solid color selection branches are shorter still (38 instructions
each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture each). Even fully divergent, the combined penalty is ~3060 extra instructions — comparable to a
sample's latency. single texture sample's latency.
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions 3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
@@ -338,9 +405,10 @@ our design:
concern. concern.
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we 4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF split heavy effects into separate pipelines. Within the main pipeline, the four
path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~1518 SDF kind branches and flag-based color selection cluster at ~2226 registers (see register
registers, so there is negligible occupancy loss. analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
**References:** **References:**
@@ -361,19 +429,20 @@ our design:
### Main pipeline: SDF + tessellated (unified) ### Main pipeline: SDF + tessellated (unified)
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte: vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex (`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
shader; the SDF path sets it to 1 via `pack_flags`. vertex shader branches on this uniform to select the tessellated or SDF code path.
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text - **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
(SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and (SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
any user-provided raw vertex geometry. user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive` - **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
structs, drawn instanced. Used for all shapes with closed-form signed distance functions. structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies `Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
gradient/texture/solid color based on flag bits. and computes `out = color * t`; kinds 14 dispatch to one of four SDF functions (RRect, NGon,
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
#### Why SDF for shapes #### Why SDF for shapes
@@ -425,47 +494,39 @@ The tessellated path retains the existing direct vertex buffer layout (20 bytes/
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
in a draw call has the same mode — so it is effectively free on all modern GPUs. in a draw call has the same mode — so it is effectively free on all modern GPUs.
#### Shape folding #### Shape kinds and SDF dispatch
The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs: `pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
and set the kind automatically:
| User-facing shape | RRect mapping | Notes | | User-facing proc | Shape_Kind | SDF function | Notes |
| ---------------------------- | -------------------------------------------- | ---------------------------------------- | | -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
| Rectangle (sharp or rounded) | Direct | Per-corner radii from `radii` param | | `rectangle` | `RRect` | `sdRoundedBox` | Per-corner radii from `radii` param |
| Circle | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size | | `rectangle_texture` | `RRect` | `sdRoundedBox` | Textured fill; `.Textured` flag set |
| Line segment / capsule | Rotated RRect, `radii = half_thickness` | Stadium shape (fully-rounded minor axis) | | `circle` | `RRect` | `sdRoundedBox` | Uniform radii = half-size (circle is a degenerate RRect) |
| Full ring / annulus | Stroked circle at mid-radius | `stroke_px = outer - inner` | | `line`, `line_strip` | `RRect` | `sdRoundedBox` | Rotated capsule — stadium shape (radii = half-thickness) |
| `ellipse` | `Ellipse` | `sdEllipseApprox` | Approximate ellipse SDF (fast, suitable for UI) |
| `polygon` | `NGon` | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle |
| `ring` (full) | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping |
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask |
| `ring` (pie slice) | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
Shapes without a closed-form RRect reduction are drawn via the tessellated path: The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
| Shape | Tessellated proc | Method | definitions and bit assignments.
| ------------------------- | ---------------------------------- | -------------------------- |
| Ellipse | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
| Regular polygon (N-gon) | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center |
| Circle sector (pie slice) | `tes_sector` | Triangle fan arc |
The `Shape_Flags` bit set controls rendering mode per primitive:
| Flag | Bit | Effect |
| ----------------- | --- | -------------------------------------------------------------------- |
| `Stroke` | 0 | Outline instead of fill (`d = abs(d) - stroke_width/2`) |
| `Textured` | 1 | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
| `Gradient` | 2 | Bilinear 4-corner interpolation from `uv.corner_colors` |
| `Gradient_Radial` | 3 | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]` |
**What stays tessellated:** **What stays tessellated:**
- Text (SDL_ttf atlas, pending future MSDF evaluation) - Text (SDL_ttf atlas, pending future MSDF evaluation)
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`) - `tess.pixel` (single-pixel points)
- Regular polygons (`tes_polygon`, `tes_polygon_lines`) - `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
- Circle sectors / pie slices (`tes_sector`) - `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
- `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
- Any raw vertex geometry submitted via `prepare_shape` - Any raw vertex geometry submitted via `prepare_shape`
The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
function or is described by a vertex list, it stays tessellated. `Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
### Effects pipeline ### Effects pipeline
@@ -538,10 +599,9 @@ while also writing to it.
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences **Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by (downsample → horizontal blur → vertical blur → composite), following the standard approach used by
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
pass has a low-to-medium register footprint (~1540 registers), well within the main pipeline's sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including require, keeping each sub-pass well under the 32-register cliff.
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command **Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
@@ -549,14 +609,13 @@ resume normal drawing. The entry/exit cost (texture copy + render-pass break) is
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
is never entered and the texture copy never happens — zero cost. is never entered and the texture copy never happens — zero cost.
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from **Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
that justifies the main/effects split does not apply here. The main/effects split protects the The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all. typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical bracket would have negligible impact on frame time.
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
### Vertex layout ### Vertex layout
@@ -590,10 +649,14 @@ Primitive :: struct {
// Total: 80 bytes (std430 aligned) // Total: 80 bytes (std430 aligned)
``` ```
`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union. `Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient `Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`. `Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
via `pack_kind_flags`.
### Draw submission order ### Draw submission order
@@ -622,8 +685,8 @@ allow resolution-independent glyph rendering from a single small atlas per font.
- Offline atlas generation via Chlumský's msdf-atlas-gen tool. - Offline atlas generation via Chlumský's msdf-atlas-gen tool.
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution). - Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind - A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch). already exists for the four current SDF kinds).
- Potential removal of the SDL_ttf dependency. - Potential removal of the SDL_ttf dependency.
This is explicitly deferred. The SDF shape migration is independent of and does not block text This is explicitly deferred. The SDF shape migration is independent of and does not block text
@@ -692,31 +755,27 @@ with the same texture but different samplers produce separate draw calls, which
#### Textured draw procs #### Textured draw procs
Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly — `rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
same parameters, same naming — with the color parameter replaced by a texture ID plus an optional parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.
tint.
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF quads, on the theory that the tessellated path's lower register count would improve occupancy at
textured branch) would improve occupancy at large fragment counts. Applying the register-pressure large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts splitting into a separate pipeline would add ~15μs per pipeline bind on the CPU side per scissor,
to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~15μs per matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.
pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
main pipeline, unified remains strictly better.
The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside `circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to parameters on each proc rather than separate overloads. Future per-shape texture variants
`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture (`circle_texture`, `ellipse_texture`) are additive.
variants (`sdf_circle_texture`) are additive.
#### What SDF anti-aliasing does and does not do for textured draws #### What SDF anti-aliasing does and does not do for textured draws
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges, The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
@@ -750,9 +809,9 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the `Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp existing rectangle handling: `fit_params` computes UVs from the fit mode, then
corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A `rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
`fit_params` call computes UVs from the fit mode before dispatch. from Clay's `cornerRadius` otherwise).
#### Deferred features #### Deferred features
@@ -764,7 +823,7 @@ The following are plumbed in the descriptor but not implemented in phase 1:
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist. - **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values. - **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers. - **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention. - **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.
**References:** **References:**