Update draw README

This commit is contained in:
Zachary Levy
2026-04-28 16:04:44 -07:00
parent c59858dcd4
commit ff29dbd92f
+235 -176
View File
@@ -9,51 +9,53 @@ The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline)
modes dispatched by a push constant:
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`,
`tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle
reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors
(`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`.
SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
(`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
`tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive
kind dispatch.
primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
`Primitive.flags`) to evaluate one of four signed distance functions:
- **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
- **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
- **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
- **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
The SDF path handles all shapes that are algebraically reducible to a rounded rectangle:
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked,
textured, and gradient-filled rectangles.
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles.
- **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
- **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
struct. Gradient and texture are mutually exclusive.
All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
— zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
exclusive.
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep`
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
instead of ~250 vertices (~5000 bytes).
The fragment shader's estimated register footprint is ~2023 VGPRs via static live-range analysis.
RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus
`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land
comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under
desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur
partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or
Radeon GPU Analyzer against the compiled SPIR-V.
The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
fragment shader's estimated peak footprint is ~2226 fp32 VGPRs (~1622 fp16 VGPRs on architectures
with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 12 regs lower (`corner_radii` vec4
replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
24 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
glyph edges and tessellated user geometry if desired.
All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
`sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
count).
## 2D rendering pipeline plan
This section documents the planned architecture for levlib's 2D rendering system. The design is driven
@@ -66,22 +68,23 @@ primitives and effects can be added to the library without architectural changes
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
**render-pass structure** (everything vs backdrop):
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
footprint (~1824 registers per thread). Runs at full GPU occupancy on every architecture.
Handles 90%+ of all fragments in a typical frame.
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
**≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
in a typical frame.
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
effects. Medium register footprint (~4860 registers). Each effects primitive includes the base
effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
occupancy at the first cliff is accepted by design). Each effects primitive includes the base
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
low-end hardware (see register analysis below).
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
where each individual pass has a low-to-medium register footprint (~1540 registers). Separated
from the other pipelines because it structurally requires ending the current render pass and
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
level boundary that cannot be avoided regardless of shader complexity.
where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
Valhall). Separated from the other pipelines because it structurally requires ending the current
render pass and copying the render target before any backdrop-sampling fragment can execute — a
command-buffer-level boundary that cannot be avoided regardless of shader complexity.
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
@@ -97,56 +100,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
allocates registers pessimistically based on the worst-case path through the shader. If the shader
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
latency.
Each GPU architecture has a **register cliff**a threshold above which occupancy starts dropping.
Below the cliff, adding registers has zero occupancy cost.
Each GPU architecture has discrete **occupancy cliffs**register counts above which the number of
concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
register over, throughput drops sharply.
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
dominant current GPU architecture:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------ | ------------------- | ------------------ | --------- |
| ~16 regs (main pipeline) | 4,096 | 1,536 | 100% |
| 32 regs | 2,048 | 1,536 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
**Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
registers**, second cliff at **64 registers**.
- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
same cliff structure — first at 32, second at 64.
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~20162018): first cliff at **16 registers**.
Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
below.
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
the current SBC market. See Known limitations below.
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
Register allocation happens at runtime based on actual usage.
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
registers of margin**:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------ | ------------------- | ------------------ | --------- |
| ~16 regs (main pipeline) | 4,096 | 2,048 | 100% |
| 32 regs | 2,048 | 2,048 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
| Pipeline | Cliff targeted | Margin | Register budget | Rationale |
| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
| Main pipeline | 32 (Valhall 1st cliff) | 8 | **≤24 regs** | Handles 90%+ of frame fragments; must run at full occupancy |
| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8 | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy |
| Effects pipeline | 64 (Valhall 2nd cliff) | 8 | **≤56 regs** | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
counts upward over a shader's lifetime:
| Register allocation | Occupancy |
| -------------------- | -------------------------- |
| 032 regs (main) | 100% (full thread count) |
| 3364 regs (effects) | ~50% (thread count halves) |
1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
allocators. Shaders typically drift ±23 registers between versions on unchanged source.
2. **Feature additions.** Each new effect, flag, or uniform adds 14 live registers. A new gradient
mode or outline option lands in this range.
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
or a contributor not knowing) costs 2 registers per affected `vec4`.
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
20 and 48 registers is modest (89100%); on Mali it is a hard 2× throughput reduction. The
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
pipeline's register cost.
Realistic creep over a couple of years is 48 registers. The cost of conservatism is zero — a shader
at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
actually render effects (~510% in a typical UI) run at reduced occupancy.
For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
compiler allocates registers for the worst-case path.
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
1224 registers — below the cliff on every architecture — so unifying them costs nothing in
occupancy.
The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
5067% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
that effects cover.
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
texture-copy out of the main render pass.
**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
#### Known limitations: V3D and Bifrost (16-register cliff)
Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
reduced occupancy regardless of which shape kind or effect is active.
Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
register footprint under 16). This conflicts with the unified-pipeline design that enables single
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
pipelines" below.
We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
allocation).
#### Verifying register counts
The register estimates in this document are hand-counted via manual live-range analysis (see Current
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
measured `malioc` numbers is a follow-up task.
#### Backdrop split: render-pass structure
@@ -156,10 +216,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
level operation that requires ending the current render pass. This boundary exists regardless of
shader complexity and cannot be optimized away.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
register-light (~1540 regs each), so merging them into the effects pipeline would cause no occupancy
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
pass sequence.
#### Why not per-primitive-type pipelines (GPUI's approach)
@@ -271,18 +332,23 @@ There are three categories of branch condition in a fragment shader, ranked by c
#### Which category our branches fall into
Our design has two branch points:
Our design has three branch points:
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
cost.**
2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3.
The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no
kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection
— all lightweight (38 instructions per path). Divergence at primitive boundaries between
different flag combinations has negligible cost.
2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~1530
instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
different kinds.
3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
solid color selection and outline rendering — all lightweight branches (38 instructions per
path). Divergence at primitive boundaries between different flag combinations has negligible cost.
For category 3, the divergence analysis depends on primitive size:
@@ -299,11 +365,12 @@ For category 3, the divergence analysis depends on primitive size:
frame-level divergence is typically **13%** of all warps.
At 13% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the
longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function.
Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU,
that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is
confirmed by production renderers that use exactly this pattern:
(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
of both (~4560 instructions total). Each divergent warp's extra cost is modest — at ~12G
instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
that use exactly this pattern:
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
@@ -327,10 +394,10 @@ our design:
> have no per-fragment data-dependent branches in the main pipeline.
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates
`sdRoundedBox`; the only branches are gradient/texture/solid color selection at 38 instructions
each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture
sample's latency.
divergent warps pay double a large cost. Our SDF kind branches are short (~1530 instructions
each), and the gradient/texture/solid color selection branches are shorter still (38 instructions
each). Even fully divergent, the combined penalty is ~3060 extra instructions — comparable to a
single texture sample's latency.
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
@@ -338,9 +405,10 @@ our design:
concern.
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF
path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~1518
registers, so there is negligible occupancy loss.
split heavy effects into separate pipelines. Within the main pipeline, the four
SDF kind branches and flag-based color selection cluster at ~2226 registers (see register
analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
**References:**
@@ -361,19 +429,20 @@ our design:
### Main pipeline: SDF + tessellated (unified)
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte:
0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex
shader; the SDF path sets it to 1 via `pack_flags`.
vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
(`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
vertex shader branches on this uniform to select the tessellated or SDF code path.
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
(SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and
any user-provided raw vertex geometry.
(SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes
`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies
gradient/texture/solid color based on flag bits.
Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
`Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
and computes `out = color * t`; kinds 14 dispatch to one of four SDF functions (RRect, NGon,
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
#### Why SDF for shapes
@@ -425,47 +494,39 @@ The tessellated path retains the existing direct vertex buffer layout (20 bytes/
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
in a draw call has the same mode — so it is effectively free on all modern GPUs.
#### Shape folding
#### Shape kinds and SDF dispatch
The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no
`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically
special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs:
The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
`pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
and set the kind automatically:
| User-facing shape | RRect mapping | Notes |
| ---------------------------- | -------------------------------------------- | ---------------------------------------- |
| Rectangle (sharp or rounded) | Direct | Per-corner radii from `radii` param |
| Circle | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size |
| Line segment / capsule | Rotated RRect, `radii = half_thickness` | Stadium shape (fully-rounded minor axis) |
| Full ring / annulus | Stroked circle at mid-radius | `stroke_px = outer - inner` |
| User-facing proc | Shape_Kind | SDF function | Notes |
| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
| `rectangle` | `RRect` | `sdRoundedBox` | Per-corner radii from `radii` param |
| `rectangle_texture` | `RRect` | `sdRoundedBox` | Textured fill; `.Textured` flag set |
| `circle` | `RRect` | `sdRoundedBox` | Uniform radii = half-size (circle is a degenerate RRect) |
| `line`, `line_strip` | `RRect` | `sdRoundedBox` | Rotated capsule — stadium shape (radii = half-thickness) |
| `ellipse` | `Ellipse` | `sdEllipseApprox` | Approximate ellipse SDF (fast, suitable for UI) |
| `polygon` | `NGon` | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle |
| `ring` (full) | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping |
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask |
| `ring` (pie slice) | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
Shapes without a closed-form RRect reduction are drawn via the tessellated path:
| Shape | Tessellated proc | Method |
| ------------------------- | ---------------------------------- | -------------------------- |
| Ellipse | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
| Regular polygon (N-gon) | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center |
| Circle sector (pie slice) | `tes_sector` | Triangle fan arc |
The `Shape_Flags` bit set controls rendering mode per primitive:
| Flag | Bit | Effect |
| ----------------- | --- | -------------------------------------------------------------------- |
| `Stroke` | 0 | Outline instead of fill (`d = abs(d) - stroke_width/2`) |
| `Textured` | 1 | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
| `Gradient` | 2 | Bilinear 4-corner interpolation from `uv.corner_colors` |
| `Gradient_Radial` | 3 | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]` |
The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
definitions and bit assignments.
**What stays tessellated:**
- Text (SDL_ttf atlas, pending future MSDF evaluation)
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`)
- Regular polygons (`tes_polygon`, `tes_polygon_lines`)
- Circle sectors / pie slices (`tes_sector`)
- `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
- `tess.pixel` (single-pixel points)
- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
- Any raw vertex geometry submitted via `prepare_shape`
The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF
function or is described by a vertex list, it stays tessellated.
The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
### Effects pipeline
@@ -538,10 +599,9 @@ while also writing to it.
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
pass has a low-to-medium register footprint (~1540 registers), well within the main pipeline's
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
require, keeping each sub-pass well under the 32-register cliff.
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
@@ -549,14 +609,13 @@ resume normal drawing. The entry/exit cost (texture copy + render-pass break) is
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
is never entered and the texture copy never happens — zero cost.
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
that justifies the main/effects split does not apply here. The main/effects split protects the
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
bracket would have negligible impact on frame time.
### Vertex layout
@@ -590,10 +649,14 @@ Primitive :: struct {
// Total: 80 bytes (std430 aligned)
```
`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union.
`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient
corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the
tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`.
`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
`Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
via `pack_kind_flags`.
### Draw submission order
@@ -622,8 +685,8 @@ allow resolution-independent glyph rendering from a single small atlas per font.
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind
distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch).
- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
already exists for the four current SDF kinds).
- Potential removal of the SDL_ttf dependency.
This is explicitly deferred. The SDF shape migration is independent of and does not block text
@@ -692,31 +755,27 @@ with the same texture but different samplers produce separate draw calls, which
#### Textured draw procs
Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and
`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly —
same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
tint.
Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
`rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF
textured branch) would improve occupancy at large fragment counts. Applying the register-pressure
analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are
well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at
100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts
to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~15μs per
pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
main pipeline, unified remains strictly better.
quads, on the theory that the tessellated path's lower register count would improve occupancy at
large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
splitting into a separate pipeline would add ~15μs per pipeline bind on the CPU side per scissor,
matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.
The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes
for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside
`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to
`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture
variants (`sdf_circle_texture`) are additive.
SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
`circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
parameters on each proc rather than separate overloads. Future per-shape texture variants
(`circle_texture`, `ellipse_texture`) are additive.
#### What SDF anti-aliasing does and does not do for textured draws
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
@@ -750,9 +809,9 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp
corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A
`fit_params` call computes UVs from the fit mode before dispatch.
existing rectangle handling: `fit_params` computes UVs from the fit mode, then
`rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
from Clay's `cornerRadius` otherwise).
#### Deferred features
@@ -764,7 +823,7 @@ The following are plumbed in the descriptor but not implemented in phase 1:
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention.
- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.
**References:**