Backdrop Path + Cybersteel (#23)

Co-authored-by: Zachary Levy <zachary@sunforge.is>
Reviewed-on: #23
This commit was merged in pull request #23.
This commit is contained in:
2026-05-01 05:43:10 +00:00
parent e36229a3ef
commit 5317b8f142
66 changed files with 5806 additions and 2427 deletions
+375 -227
View File
@@ -5,54 +5,60 @@ Clay UI integration.
## Current state
The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
The renderer uses a single unified `Core_2D` (`TRIANGLELIST` pipeline) with two submission
modes dispatched by a push constant:
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`,
`tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle
reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors
(`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`.
SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
(`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
`tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive
kind dispatch.
`Core_2D_Primitive` structs (96 bytes each) uploaded each frame to a GPU storage buffer. The vertex
shader reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
`Core_2D_Primitive.flags`) to evaluate one of four signed distance functions:
- **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
- **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
- **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
- **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
The SDF path handles all shapes that are algebraically reducible to a rounded rectangle:
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked,
textured, and gradient-filled rectangles.
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles.
- **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
- **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
struct. Gradient and texture are mutually exclusive.
All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
gradients, and texture fills via `Shape_Flags` (see `core_2d.odin`). The texture UV rect
(`uv_rect: [4]f32`) and the gradient/outline parameters (`effects: Gradient_Outline`) live in their
own 16-byte slots in `Core_2D_Primitive`, so a primitive can carry texture and outline simultaneously.
Gradient and texture remain mutually exclusive at the fill-source level (a Brush variant chooses one
or the other) since they share the worst-case fragment-shader register path.
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep`
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (96 bytes)
instead of ~250 vertices (~5000 bytes).
The fragment shader's estimated register footprint is ~2023 VGPRs via static live-range analysis.
RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus
`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land
comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under
desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur
partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or
Radeon GPU Analyzer against the compiled SPIR-V.
The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
in the pipeline plan below for the full cliff/margin analysis and SBC architecture context).
The fragment shader's estimated peak footprint is ~2226 fp32 VGPRs (~1622 fp16 VGPRs on architectures
with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
like `f_color`, `f_uv_rect`/`f_effects`, and `half_size`). RRect is 12 regs lower (`corner_radii` vec4
replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
24 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
glyph edges and tessellated user geometry if desired.
All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
`sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
count).
MSAA is intentionally not supported. SDF text and shapes compute fragment coverage analytically
via `smoothstep`, so they don't benefit from multisampling. Tessellated user geometry submitted via
`prepare_shape` is rendered without anti-aliasing — if AA is required for tessellated content, the
caller must render it to their own offscreen target and submit the result as a texture. This
decision matches RAD Debugger's architecture and aligns with the SBC target (Mali Valhall, where
MSAA's per-tile bandwidth multiplier is expensive).
## 2D rendering pipeline plan
@@ -66,22 +72,23 @@ primitives and effects can be added to the library without architectural changes
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
**render-pass structure** (everything vs backdrop):
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
footprint (~1824 registers per thread). Runs at full GPU occupancy on every architecture.
Handles 90%+ of all fragments in a typical frame.
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
**≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
in a typical frame.
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
effects. Medium register footprint (~4860 registers). Each effects primitive includes the base
effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
occupancy at the first cliff is accepted by design). Each effects primitive includes the base
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
low-end hardware (see register analysis below).
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
where each individual pass has a low-to-medium register footprint (~1540 registers). Separated
from the other pipelines because it structurally requires ending the current render pass and
copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
level boundary that cannot be avoided regardless of shader complexity.
where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
Valhall). Separated from the other pipelines because it structurally requires ending the current
render pass and copying the render target before any backdrop-sampling fragment can execute — a
command-buffer-level boundary that cannot be avoided regardless of shader complexity.
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
@@ -97,56 +104,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
allocates registers pessimistically based on the worst-case path through the shader. If the shader
contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
latency.
Each GPU architecture has a **register cliff**a threshold above which occupancy starts dropping.
Below the cliff, adding registers has zero occupancy cost.
Each GPU architecture has discrete **occupancy cliffs**register counts above which the number of
concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
register over, throughput drops sharply.
On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
dominant current GPU architecture:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------ | ------------------- | ------------------ | --------- |
| ~16 regs (main pipeline) | 4,096 | 1,536 | 100% |
| 32 regs | 2,048 | 1,536 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~89% |
- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
**Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
registers**, second cliff at **64 registers**.
- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
same cliff structure — first at 32, second at 64.
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~20162018): first cliff at **16 registers**.
Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
below.
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
the current SBC market. See Known limitations below.
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
Register allocation happens at runtime based on actual usage.
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
registers of margin**:
| Register allocation | Reg-limited threads | Actual (hw-capped) | Occupancy |
| ------------------------ | ------------------- | ------------------ | --------- |
| ~16 regs (main pipeline) | 4,096 | 2,048 | 100% |
| 32 regs | 2,048 | 2,048 | 100% |
| 48 regs (effects) | 1,365 | 1,365 | ~67% |
| Pipeline | Cliff targeted | Margin | Register budget | Rationale |
| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
| Main pipeline | 32 (Valhall 1st cliff) | 8 | **≤24 regs** | Handles 90%+ of frame fragments; must run at full occupancy |
| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8 | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy |
| Effects pipeline | 64 (Valhall 2nd cliff) | 8 | **≤56 regs** | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
counts upward over a shader's lifetime:
| Register allocation | Occupancy |
| -------------------- | -------------------------- |
| 032 regs (main) | 100% (full thread count) |
| 3364 regs (effects) | ~50% (thread count halves) |
1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
allocators. Shaders typically drift ±23 registers between versions on unchanged source.
2. **Feature additions.** Each new effect, flag, or uniform adds 14 live registers. A new gradient
mode or outline option lands in this range.
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
or a contributor not knowing) costs 2 registers per affected `vec4`.
Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
20 and 48 registers is modest (89100%); on Mali it is a hard 2× throughput reduction. The
main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
pipeline's register cost.
Realistic creep over a couple of years is 48 registers. The cost of conservatism is zero — a shader
at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
actually render effects (~510% in a typical UI) run at reduced occupancy.
For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
compiler allocates registers for the worst-case path.
All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
1224 registers — below the cliff on every architecture — so unifying them costs nothing in
occupancy.
The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
5067% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
that effects cover.
**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
static register-pressure argument on M3 and later, but the split remains useful for isolating blur
ALU complexity and keeping the backdrop texture-copy out of the main render pass.
**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
texture-copy out of the main render pass.
**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
#### Known limitations: V3D and Bifrost (16-register cliff)
Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
reduced occupancy regardless of which shape kind or effect is active.
Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
register footprint under 16). This conflicts with the unified-pipeline design that enables single
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
pipelines" below.
We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
allocation).
#### Verifying register counts
The register estimates in this document are hand-counted via manual live-range analysis (see Current
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
measured `malioc` numbers is a follow-up task.
#### Backdrop split: render-pass structure
@@ -156,10 +220,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
level operation that requires ending the current render pass. This boundary exists regardless of
shader complexity and cannot be optimized away.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
register-light (~1540 regs each), so merging them into the effects pipeline would cause no occupancy
problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
pass sequence.
#### Why not per-primitive-type pipelines (GPUI's approach)
@@ -188,9 +253,9 @@ API where each layer draws shadows before quads before glyphs. Our design avoids
submission order is draw order, no layer juggling required.
**PSO compilation costs multiply.** Each pipeline takes 150ms to compile on Metal/Vulkan/D3D12 at
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
additional axis with 7 pipelines vs 3.
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (blend
modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per additional
axis with 7 pipelines vs 3.
**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
the strongest candidate for per-kind splitting because effect branches are heavier than shape
@@ -271,18 +336,23 @@ There are three categories of branch condition in a fragment shader, ranked by c
#### Which category our branches fall into
Our design has two branch points:
Our design has three branch points:
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
cost.**
2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3.
The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no
kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection
— all lightweight (38 instructions per path). Divergence at primitive boundaries between
different flag combinations has negligible cost.
2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~1530
instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
different kinds.
3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
solid color selection and outline rendering — all lightweight branches (38 instructions per
path). Divergence at primitive boundaries between different flag combinations has negligible cost.
For category 3, the divergence analysis depends on primitive size:
@@ -299,11 +369,12 @@ For category 3, the divergence analysis depends on primitive size:
frame-level divergence is typically **13%** of all warps.
At 13% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the
longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function.
Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU,
that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is
confirmed by production renderers that use exactly this pattern:
(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
of both (~4560 instructions total). Each divergent warp's extra cost is modest — at ~12G
instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
that use exactly this pattern:
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
@@ -327,10 +398,10 @@ our design:
> have no per-fragment data-dependent branches in the main pipeline.
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates
`sdRoundedBox`; the only branches are gradient/texture/solid color selection at 38 instructions
each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture
sample's latency.
divergent warps pay double a large cost. Our SDF kind branches are short (~1530 instructions
each), and the gradient/texture/solid color selection branches are shorter still (38 instructions
each). Even fully divergent, the combined penalty is ~3060 extra instructions — comparable to a
single texture sample's latency.
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
@@ -338,9 +409,10 @@ our design:
concern.
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF
path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~1518
registers, so there is negligible occupancy loss.
split heavy effects into separate pipelines. Within the main pipeline, the four
SDF kind branches and flag-based color selection cluster at ~2226 registers (see register
analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
**References:**
@@ -361,27 +433,29 @@ our design:
### Main pipeline: SDF + tessellated (unified)
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte:
0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex
shader; the SDF path sets it to 1 via `pack_flags`.
vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms_2D` push constant
(`Core_2D_Mode.Tessellated = 0`, `Core_2D_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
vertex shader branches on this uniform to select the tessellated or SDF code path.
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
(SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and
any user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
(SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of
`Core_2D_Primitive` structs, drawn instanced. Used for all shapes with closed-form signed distance
functions.
Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes
`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies
gradient/texture/solid color based on flag bits.
Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
`Core_2D_Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture
sample and computes `out = color * t`; kinds 14 dispatch to one of four SDF functions (RRect, NGon,
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
#### Why SDF for shapes
CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:
1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
= 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
vertices. That is roughly a 90× reduction per shape.
= 5 KB. An SDF rounded rectangle is one `Core_2D_Primitive` struct (96 bytes) plus 4 shared
unit-quad vertices. That is roughly a 50× reduction per shape.
2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
@@ -412,60 +486,55 @@ SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceInd
shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
pattern used by both Zed GPUI and vger-rs.
Each SDF shape is described by a single `Primitive` struct (80 bytes) in the storage buffer. The
vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
interpolated varyings.
Each SDF shape is described by a single `Core_2D_Primitive` struct (96 bytes) in the storage
buffer. The vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position
from the unit vertex and the primitive's bounds, and passes shape parameters to the fragment shader
via `flat` interpolated varyings.
Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
buffer instancing eliminates the 46× data duplication across quad corners. A rounded rectangle costs
80 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.
96 bytes instead of 4 vertices × 60+ bytes = 240+ bytes.
The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
in a draw call has the same mode — so it is effectively free on all modern GPUs.
#### Shape folding
#### Shape kinds and SDF dispatch
The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no
`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically
special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs:
The fragment shader dispatches on `Shape_Kind` (low byte of `Core_2D_Primitive.flags`) to evaluate
one of four signed distance functions. The `Shape_Kind` enum, per-kind `*_Params` structs, and
CPU-side drawing procs all live in `core_2d.odin`. The drawing procs build the appropriate
`Core_2D_Primitive` and set the kind automatically:
| User-facing shape | RRect mapping | Notes |
| ---------------------------- | -------------------------------------------- | ---------------------------------------- |
| Rectangle (sharp or rounded) | Direct | Per-corner radii from `radii` param |
| Circle | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size |
| Line segment / capsule | Rotated RRect, `radii = half_thickness` | Stadium shape (fully-rounded minor axis) |
| Full ring / annulus | Stroked circle at mid-radius | `stroke_px = outer - inner` |
Each user-facing shape proc accepts a `Brush` union (color, linear gradient, radial gradient,
or textured fill) as its fill source, plus optional outline parameters. The procs map to SDF
kinds as follows:
Shapes without a closed-form RRect reduction are drawn via the tessellated path:
| User-facing proc | Shape_Kind | SDF function | Notes |
| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
| `rectangle` | `RRect` | `sdRoundedBox` | Per-corner radii from `radii` param |
| `circle` | `RRect` | `sdRoundedBox` | Uniform radii = half-size (circle is a degenerate RRect) |
| `line`, `line_strip` | `RRect` | `sdRoundedBox` | Rotated capsule — stadium shape (radii = half-thickness) |
| `ellipse` | `Ellipse` | `sdEllipseApprox` | Approximate ellipse SDF (fast, suitable for UI) |
| `polygon` | `NGon` | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle |
| `ring` (full) | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping |
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask |
| `ring` (pie slice) | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
| Shape | Tessellated proc | Method |
| ------------------------- | ---------------------------------- | -------------------------- |
| Ellipse | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
| Regular polygon (N-gon) | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center |
| Circle sector (pie slice) | `tes_sector` | Triangle fan arc |
The `Shape_Flags` bit set controls rendering mode per primitive:
| Flag | Bit | Effect |
| ----------------- | --- | -------------------------------------------------------------------- |
| `Stroke` | 0 | Outline instead of fill (`d = abs(d) - stroke_width/2`) |
| `Textured` | 1 | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
| `Gradient` | 2 | Bilinear 4-corner interpolation from `uv.corner_colors` |
| `Gradient_Radial` | 3 | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]` |
The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
arc geometry). See the `Shape_Flag` enum in `core_2d.odin` for the authoritative flag
definitions and bit assignments.
**What stays tessellated:**
- Text (SDL_ttf atlas, pending future MSDF evaluation)
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`)
- Regular polygons (`tes_polygon`, `tes_polygon_lines`)
- Circle sectors / pie slices (`tes_sector`)
- `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
- `tess.pixel` (single-pixel points)
- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
- Any raw vertex geometry submitted via `prepare_shape`
The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF
function or is described by a vertex list, it stays tessellated.
The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
### Effects pipeline
@@ -526,44 +595,121 @@ Wallace's variant) and vger-rs.
### Backdrop pipeline
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
register pressure.
refraction, mirror surfaces. It is separated from the main and effects pipelines for a structural
reason, not register pressure.
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
while also writing to it.
must be in a sampler-readable state. A draw call that samples the render target it is also writing
to is a hard GPU constraint; the only way to satisfy it is to end the current render pass and start
a new one. That render-pass boundary is what a “bracket” is.
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
pass has a low-to-medium register footprint (~1540 registers), well within the main pipeline's
occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
Mali-G31 and VideoCore VI) where per-thread register limits are tight.
(downsample → horizontal blur → vertical blur → composite), following the standard approach used
by iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
require, keeping each sub-pass well under the 32-register cliff.
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
is never entered and the texture copy never happens — zero cost.
**Render-target choice.** When any layer in the frame contains a backdrop draw, the entire
frame renders into `source_texture` (a full-resolution single-sample texture owned by the
backdrop pipeline) instead of directly into the swapchain. At the end of the frame,
`source_texture` is copied to the swapchain via a single `CopyGPUTextureToTexture` call.
This means the bracket has no mid-frame texture copy: by the time the bracket runs,
`source_texture` already contains the pre-bracket frame contents and is the natural sampler
input. When no layer in the frame has a backdrop draw, the existing fast path runs: the frame
renders directly to the swapchain and the backdrop pipeline's working textures are never
touched. Zero cost for backdrop-free frames.
**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
that justifies the main/effects split does not apply here. The main/effects split protects the
_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
bracket would have negligible impact on frame time.
#### Bracket scheduling model
The bracket is scheduled per layer, anchored at the first backdrop sub-batch in the layer's
submission order. Concretely, a layer with one or more backdrops splits into three groups:
1. **Pass A (pre-bracket)** — every non-backdrop sub-batch with index `< bracket_start_index`.
Renders to `source_texture` in a single render pass.
2. **The bracket** — every backdrop sub-batch in the layer (regardless of index). Runs one
downsample pass, then one (H-blur + V-composite) pass pair per unique sigma.
3. **Pass B (post-bracket)** — every non-backdrop sub-batch with index `>= bracket_start_index`.
Renders to `source_texture` with `LOAD`, drawing on top of the composited backdrop output.
`bracket_start_index` is the absolute index of the first `.Backdrop` kind in the layer's sub-batch
range. If the layer has no backdrops, none of this kicks in and the layer renders in a single render
pass via the existing fast path.
Per-sigma-group execution. The bracket walks each layer's sub-batches and groups contiguous
`.Backdrop` sub-batches that share a sigma; each group picks its own downsample factor (1, 2, or 4)
based on `compute_backdrop_downsample_factor`. For each group it runs four sub-passes: a downsample
from `source_texture` to `downsample_texture`; an H-blur from `downsample_texture` to
`h_blur_texture`; a V-blur from `h_blur_texture` back into `downsample_texture` (ping-pong reuse);
and finally a composite that reads the fully-blurred `downsample_texture`, applies the SDF mask
and tint, and writes the result to `source_texture`. Sub-batch coalescing in
`append_or_extend_sub_batch` merges contiguous same-sigma backdrops into a single instanced
composite draw; non-contiguous same-sigma backdrops still share the blur output but issue separate
composite draws.
The working textures are sized at the full swapchain resolution; larger downsample factors only
fill a sub-rect via viewport-limited rendering (see the comment block at the top of `backdrop.odin`
for the factor-selection table and rationale).
#### Submission-order trade-off
Within Pass A and Pass B, sub-batches render in the user's submission order. What the bracket model
sacrifices is _interleaved_ ordering between backdrop and non-backdrop content within a single
layer. A non-backdrop sub-batch submitted between two backdrops still renders in Pass B (after the
bracket), not at its submission position. Worked example:
```
draw.rectangle(layer, bg, GRAY) // 0 Tessellated → Pass A
draw.rectangle(layer, card_blue, BLUE) // 1 SDF → Pass A
draw.gaussian_blur(layer, panelA, sigma=12) // 2 Backdrop → Bracket (sees: bg + blue card)
draw.rectangle(layer, card_red, RED) // 3 SDF → Pass B (drawn ON TOP of panelA)
draw.gaussian_blur(layer, panelB, sigma=12) // 4 Backdrop → Bracket (sees: bg + blue card; same as panelA)
draw.text(layer, "label", ...) // 5 Text → Pass B (drawn ON TOP of both panels)
```
In this layer, panelB does _not_ see card_red — even though card_red was submitted before panelB —
because both backdrops sample `source_texture` as it stood at the bracket entry, which is after
Pass A and before card_red has rendered. card_red ends up on top of panelA, not underneath it.
The user controls the alternative outcome by splitting layers. Putting card_red and panelB into a
new layer (via `draw.new_layer`) gives panelB a fresh source snapshot that includes panelA and
card_red:
```
base := draw.begin(...)
draw.rectangle(base, bg, GRAY)
draw.rectangle(base, card_blue, BLUE)
draw.gaussian_blur(base, panelA, sigma=12) // panelA in base layer's bracket
top := draw.new_layer(base, ...)
draw.rectangle(top, card_red, RED)
draw.gaussian_blur(top, panelB, sigma=12) // top layer's bracket; sees base + card_red
draw.text(top, "label", ...)
```
Why one bracket per layer and not one per backdrop? Each bracket adds three render passes
(downsample + H-blur + V-composite) and at least three tile-cache flushes on tilers like Mali
Valhall. Strict submission-order semantics would require one bracket per cluster of contiguous
backdrops, which scales the GPU cost linearly with how interleaved the user's submission happens
to be — a footgun. The current design caps the bracket cost per layer regardless of submission
interleave, and gives the user explicit control over ordering through the existing layer
abstraction. This matches the cost/complexity envelope of iOS `UIVisualEffectView` and CSS
`backdrop-filter` (both of which constrain backdrop ordering implicitly).
### Vertex layout
The vertex struct is unchanged from the current 20-byte layout:
```
Vertex :: struct {
Vertex_2D :: struct {
position: [2]f32, // 0: screen-space position
uv: [2]f32, // 8: atlas UV (text) or unused (shapes)
color: Color, // 16: u8x4, GPU-normalized to float
@@ -575,25 +721,30 @@ draws, `position` carries actual world-space geometry. For SDF draws, `position`
corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
primitive's bounds.
The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
The `Core_2D_Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
```
Primitive :: struct {
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
flags: u32, // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
rotation_sc: u32, // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
_pad: f32, // 28: reserved for future use
params: Shape_Params, // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
uv: Uv_Or_Effects, // 64: texture UV rect or gradient/outline parameters (16 bytes)
Core_2D_Primitive :: struct {
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
flags: u32, // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
rotation_sc: u32, // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
_pad: f32, // 28: reserved for future use
params: Shape_Params, // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
uv_rect: [4]f32, // 64: texture UV coordinates. Read when .Textured.
effects: Gradient_Outline, // 80: gradient and/or outline parameters (16 bytes).
}
// Total: 80 bytes (std430 aligned)
// Total: 96 bytes (std430 aligned)
```
`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union.
`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient
corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the
tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`.
`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `core_2d.odin`. Each SDF kind
writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
`Gradient_Outline` is a 16-byte struct containing `gradient_color: Color`, `outline_color: Color`,
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
width). It is independent of `uv_rect`, so a primitive can carry texture and outline parameters at
the same time. The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits
8+ via `pack_kind_flags`.
### Draw submission order
@@ -617,17 +768,16 @@ pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData
**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
mode with `mode = 0`, sampling the SDL_ttf atlas texture.
A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
MSDF (multi-channel signed distance field) text rendering may be evaluated later, which would
allow resolution-independent glyph rendering from a single small atlas per font. This would involve:
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind
distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch).
- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
already exists for the four current SDF kinds).
- Potential removal of the SDL_ttf dependency.
This is explicitly deferred. The SDF shape migration is independent of and does not block text
changes.
This is explicitly deferred.
**References:**
@@ -641,8 +791,8 @@ changes.
### Textures
Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The
work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs
that route into the existing tessellated and SDF paths respectively.
work is a resource layer (registration, upload, sampling, lifecycle) plus a `Texture_Fill` Brush
variant that routes the existing shape procs through the SDF path with the `.Textured` flag set.
#### Why draw owns registered textures
@@ -692,31 +842,30 @@ with the same texture but different samplers produce separate draw calls, which
#### Textured draw procs
Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and
`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly —
same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
tint.
Textures share the same shape procs as colors and gradients. Each shape proc takes a `Brush`
union as its fill source; passing a `Texture_Fill` value (carrying `Texture_Id`, `tint`,
`uv_rect`, and `Sampler_Preset`) routes the draw through the SDF path with the `.Textured`
flag set. There is no dedicated `rectangle_texture` / `circle_texture` proc — the same
`rectangle`, `circle`, `ellipse`, `polygon`, `ring`, `line`, and `line_strip` procs handle
all fill sources.
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF
textured branch) would improve occupancy at large fragment counts. Applying the register-pressure
analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are
well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at
100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts
to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~15μs per
pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
main pipeline, unified remains strictly better.
A separate tessellated proc for "simple" fullscreen quads was considered on the theory that
the tessellated path's lower register count would improve occupancy at large fragment counts.
Both paths are well within the ≤24-register main pipeline budget — both run at full
occupancy on every target architecture (Valhall and above). The remaining ALU difference
(~15 extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise.
Meanwhile, splitting into a separate pipeline would add ~15μs per pipeline bind on the CPU
side per scissor, matching or exceeding the GPU-side savings. Within the main pipeline,
unified remains strictly better.
The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes
for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside
`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to
`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture
variants (`sdf_circle_texture`) are additive.
SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `circle`,
`ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients, textures, and outlines are
selected via the `Brush` union and optional outline parameters rather than separate overloads.
#### What SDF anti-aliasing does and does not do for textured draws
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
@@ -725,8 +874,8 @@ depends on how closely the display size matches the SDL_ttf atlas's rasterized s
#### Fit modes are a computation layer, not a renderer concept
Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV
sub-region computations on top of the `uv_rect` parameter that both textured-draw procs accept. The
renderer has no knowledge of fit modes — it samples whatever UV region it is given.
sub-region computations on top of the `uv_rect` field of `Texture_Fill`. The renderer has no
knowledge of fit modes — it samples whatever UV region it is given.
A `fit_params` helper computes the appropriate `uv_rect`, sampler preset, and (for letterbox/fit
mode) shrunken inner rect from a `Fit_Mode` enum, the target rect, and the texture's pixel size.
@@ -750,13 +899,13 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp
corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A
`fit_params` call computes UVs from the fit mode before dispatch.
existing rectangle handling: `fit_params` computes UVs from the fit mode, then `rectangle` is
called with a `Texture_Fill` brush and the appropriate radii (zero for sharp corners, per-corner
values from Clay's `cornerRadius` otherwise).
#### Deferred features
The following are plumbed in the descriptor but not implemented in phase 1:
The following are plumbed in `Texture_Desc` but not yet implemented:
- **Mipmaps**: `Texture_Desc.mip_levels` field exists; generation via SDL3 deferred.
- **Compressed formats**: `Texture_Desc.format` accepts BC/ASTC; upload path deferred.
@@ -764,7 +913,6 @@ The following are plumbed in the descriptor but not implemented in phase 1:
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention.
**References:**