903 lines
57 KiB
Markdown
903 lines
57 KiB
Markdown
# draw
|
||
|
||
2D rendering library built on SDL3 GPU, providing a unified shape-drawing and text-rendering API with
|
||
Clay UI integration.
|
||
|
||
## Current state
|
||
|
||
The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
|
||
modes dispatched by a push constant:
|
||
|
||
- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
|
||
SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
|
||
(`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
|
||
`tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
|
||
shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.
|
||
|
||
- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
|
||
`Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
|
||
reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
|
||
primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
|
||
`Primitive.flags`) to evaluate one of four signed distance functions:
|
||
- **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
|
||
circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
|
||
radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
|
||
- **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
|
||
- **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
|
||
- **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
|
||
normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).
|
||
|
||
All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
|
||
gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
|
||
parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
|
||
— zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
|
||
exclusive.
|
||
|
||
All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
|
||
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
|
||
instead of ~250 vertices (~5000 bytes).
|
||
|
||
The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
|
||
in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
|
||
fragment shader's estimated peak footprint is ~22–26 fp32 VGPRs (~16–22 fp16 VGPRs on architectures
|
||
with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
|
||
(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
|
||
like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 1–2 regs lower (`corner_radii` vec4
|
||
replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
|
||
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
|
||
2–4 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
|
||
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
|
||
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
|
||
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
|
||
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
|
||
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).
|
||
|
||
MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
|
||
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
|
||
glyph edges and tessellated user geometry if desired.
|
||
|
||
## 2D rendering pipeline plan
|
||
|
||
This section documents the planned architecture for levlib's 2D rendering system. The design is driven
|
||
by three goals: **draw quality** (mathematically exact curves with perfect anti-aliasing), **efficiency**
|
||
(minimal vertex bandwidth, high GPU occupancy, low draw-call count), and **extensibility** (new
|
||
primitives and effects can be added to the library without architectural changes).
|
||
|
||
### Overview: three pipelines
|
||
|
||
The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
|
||
**render-pass structure** (everything vs backdrop):
|
||
|
||
1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
|
||
**≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
|
||
in a typical frame.
|
||
|
||
2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
|
||
effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
|
||
occupancy at the first cliff is accepted by design). Each effects primitive includes the base
|
||
shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
|
||
redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
|
||
low-end hardware (see register analysis below).
|
||
|
||
3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
|
||
target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
|
||
where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
|
||
Valhall). Separated from the other pipelines because it structurally requires ending the current
|
||
render pass and copying the render target before any backdrop-sampling fragment can execute — a
|
||
command-buffer-level boundary that cannot be avoided regardless of shader complexity.
|
||
|
||
A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
|
||
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
|
||
switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching
|
||
overhead is negligible relative to an 8.3ms (120 FPS) frame budget.
|
||
|
||
### Why three pipelines, not one or seven
|
||
|
||
The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
|
||
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).
|
||
|
||
#### Main/effects split: register pressure
|
||
|
||
A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
|
||
allocates registers pessimistically based on the worst-case path through the shader. If the shader
|
||
contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
|
||
trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
|
||
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
|
||
latency.
|
||
|
||
Each GPU architecture has discrete **occupancy cliffs** — register counts above which the number of
|
||
concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
|
||
register over, throughput drops sharply.
|
||
|
||
**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
|
||
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
|
||
dominant current GPU architecture:
|
||
|
||
- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
|
||
**Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
|
||
registers**, second cliff at **64 registers**.
|
||
- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
|
||
same cliff structure — first at 32, second at 64.
|
||
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~2016–2018): first cliff at **16 registers**.
|
||
Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
|
||
below.
|
||
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
|
||
the current SBC market. See Known limitations below.
|
||
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
|
||
Register allocation happens at runtime based on actual usage.
|
||
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
|
||
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.
|
||
|
||
**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
|
||
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
|
||
registers of margin**:
|
||
|
||
| Pipeline | Cliff targeted | Margin | Register budget | Rationale |
|
||
| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
|
||
| Main pipeline | 32 (Valhall 1st cliff) | 8 | **≤24 regs** | Handles 90%+ of frame fragments; must run at full occupancy |
|
||
| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8 | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy |
|
||
| Effects pipeline | 64 (Valhall 2nd cliff) | 8 | **≤56 regs** | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |
|
||
|
||
**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
|
||
counts upward over a shader's lifetime:
|
||
|
||
1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
|
||
allocators. Shaders typically drift ±2–3 registers between versions on unchanged source.
|
||
2. **Feature additions.** Each new effect, flag, or uniform adds 1–4 live registers. A new gradient
|
||
mode or outline option lands in this range.
|
||
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
|
||
or a contributor not knowing) costs 2 registers per affected `vec4`.
|
||
|
||
Realistic creep over a couple of years is 4–8 registers. The cost of conservatism is zero — a shader
|
||
at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
|
||
a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.
|
||
|
||
**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
|
||
SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
|
||
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
|
||
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
|
||
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
|
||
actually render effects (~5–10% in a typical UI) run at reduced occupancy.
|
||
|
||
For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
|
||
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
|
||
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
|
||
Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
|
||
compiler allocates registers for the worst-case path.
|
||
|
||
The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
|
||
50–67% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
|
||
that effects cover.
|
||
|
||
**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
|
||
actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
|
||
later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
|
||
texture-copy out of the main render pass.
|
||
|
||
**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
|
||
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
|
||
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
|
||
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
|
||
|
||
#### Known limitations: V3D and Bifrost (16-register cliff)
|
||
|
||
Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
|
||
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
|
||
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
|
||
reduced occupancy regardless of which shape kind or effect is active.
|
||
|
||
Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
|
||
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
|
||
register footprint under 16). This conflicts with the unified-pipeline design that enables single
|
||
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
|
||
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
|
||
pipelines" below.
|
||
|
||
We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
|
||
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
|
||
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
|
||
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
|
||
allocation).
|
||
|
||
#### Verifying register counts
|
||
|
||
The register estimates in this document are hand-counted via manual live-range analysis (see Current
|
||
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
|
||
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
|
||
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
|
||
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
|
||
measured `malioc` numbers is a follow-up task.
|
||
|
||
#### Backdrop split: render-pass structure
|
||
|
||
The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
|
||
reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
|
||
render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
|
||
level operation that requires ending the current render pass. This boundary exists regardless of
|
||
shader complexity and cannot be optimized away.
|
||
|
||
The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
|
||
at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
|
||
cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
|
||
effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
|
||
pass sequence.
|
||
|
||
#### Why not per-primitive-type pipelines (GPUI's approach)
|
||
|
||
Zed's GPUI uses 7 separate shader pairs:
|
||
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
|
||
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
|
||
for our use case:
|
||
|
||
**Draw call count scales with kind variety, not just scissor count.** With a unified pipeline,
|
||
one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With
|
||
per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a
|
||
typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces
|
||
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
|
||
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
|
||
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
|
||
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
|
||
negligible (all members cluster at 12–22 registers).
|
||
|
||
**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
|
||
from one storage buffer, submission order equals draw order — Clay's painterly render commands flow
|
||
through without reordering. With separate pipelines per kind, primitives can only batch with
|
||
same-kind neighbors, which means interleaved kinds (e.g., `[rrect, circle, text, rrect, text]`) must
|
||
either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort
|
||
by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their
|
||
API where each layer draws shadows before quads before glyphs. Our design avoids this constraint:
|
||
submission order is draw order, no layer juggling required.
|
||
|
||
**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
|
||
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
|
||
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
|
||
additional axis with 7 pipelines vs 3.
|
||
|
||
**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
|
||
the strongest candidate for per-kind splitting because effect branches are heavier than shape
|
||
branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses.
|
||
Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved
|
||
in submission order:
|
||
|
||
- _Unified effects pipeline (our plan):_ 1 pipeline bind, 1 instanced draw call. Category-3
|
||
divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2
|
||
boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra
|
||
instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ **1.7μs**.
|
||
|
||
- _Per-kind effects pipelines (GPUI-style):_ 2 pipeline binds + 2 draw calls. But submission order
|
||
is `[drop, drop, inner, drop, drop, inner, drop, ...]` — the two inner-shadow primitives split the
|
||
drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline
|
||
switches, not 2. Cost: 5 × 5μs + 4 × 5μs = **45μs**.
|
||
|
||
The per-kind approach costs **26× more** than the unified approach's divergence penalty (45μs vs
|
||
1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most
|
||
extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing
|
||
~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
|
||
cheaper than the pipeline-switching alternative.
|
||
|
||
The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
|
||
and structural render-pass requirements (see analysis above). Within a pipeline, unified is
|
||
strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
|
||
negligible GPU-side branching cost.
|
||
|
||
**References:**
|
||
|
||
- Zed GPUI blog post on their per-primitive pipeline architecture:
|
||
https://zed.dev/blog/videogame
|
||
- Zed GPUI Metal shader source (7 shader pairs):
|
||
https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
|
||
- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
|
||
https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
|
||
- NVIDIA Ampere GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.6, 64 for
|
||
cc 8.0), register file size (64K), occupancy factors:
|
||
https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
|
||
- NVIDIA Ada GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.9):
|
||
https://docs.nvidia.com/cuda/ada-tuning-guide/index.html
|
||
- CUDA Occupancy Calculation walkthrough (register allocation granularity, worked examples):
|
||
https://leimao.github.io/blog/CUDA-Occupancy-Calculation/
|
||
- Apple M3 GPU architecture — Dynamic Caching (register file virtualization) eliminates static
|
||
worst-case register allocation, reducing the occupancy penalty for high-register shaders:
|
||
https://asplos.dev/wiki/m3-chip-explainer/gpu/index.html
|
||
|
||
### Why fragment shader branching is safe in this design
|
||
|
||
There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware
|
||
where shader cores had no branch instructions at all — compilers emitted code for both sides of every
|
||
branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward),
|
||
this is no longer the case. Native dynamic branching is fully supported on all current hardware.
|
||
However, branching _can_ still be costly in specific circumstances. Understanding which circumstances
|
||
apply to our design — and which do not — is critical to justifying the unified-pipeline approach.
|
||
|
||
#### How GPU branching works
|
||
|
||
GPUs execute fragment shaders in **warps** (NVIDIA/Intel, 32 threads) or **wavefronts** (AMD, 32 or
|
||
64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a
|
||
branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken
|
||
path and skips the other — **zero cost**, identical to a CPU branch. This is called a **uniform
|
||
branch** or **warp-coherent branch**.
|
||
|
||
When threads within the same warp disagree on which path to take, the warp must execute both paths
|
||
sequentially, masking off threads that don't belong to the active path. This is called **warp
|
||
divergence** and it causes the warp to pay the cost of both sides of the branch. In the worst case
|
||
(50/50 split), throughput halves for that warp.
|
||
|
||
There are three categories of branch condition in a fragment shader, ranked by cost:
|
||
|
||
| Category | Condition source | GPU behavior | Cost |
|
||
| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------- |
|
||
| **Compile-time constant** | `#ifdef`, `const bool` | Dead code eliminated by compiler | Zero |
|
||
| **Uniform / push constant** | Same value for entire draw call | Warp-coherent; GPU skips dead path | Effectively zero |
|
||
| **Per-primitive `flat` varying** | Same value across all fragments of a primitive | Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries | Near zero (see below) |
|
||
| **Per-fragment varying** | Different value per pixel (e.g., texture lookup, screen position) | Potentially divergent within every warp | Can be expensive |
|
||
|
||
#### Which category our branches fall into
|
||
|
||
Our design has three branch points:
|
||
|
||
1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
|
||
Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
|
||
cost.**
|
||
|
||
2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
|
||
The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
|
||
to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
|
||
kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~15–30
|
||
instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
|
||
different kinds.
|
||
|
||
3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
|
||
The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
|
||
solid color selection and outline rendering — all lightweight branches (3–8 instructions per
|
||
path). Divergence at primitive boundaries between different flag combinations has negligible cost.
|
||
|
||
For category 3, the divergence analysis depends on primitive size:
|
||
|
||
- **Large primitives** (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect
|
||
produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a
|
||
different kind. Divergence rate: **0.6%** of warps.
|
||
|
||
- **Small primitives** (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps
|
||
diverge. Divergence rate: **25%** of warps for that primitive, but the primitive itself covers a
|
||
tiny fraction of the frame's total fragments.
|
||
|
||
- **Worst realistic case**: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect
|
||
icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total
|
||
frame-level divergence is typically **1–3%** of all warps.
|
||
|
||
At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
|
||
(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
|
||
is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
|
||
of both (~45–60 instructions total). Each divergent warp's extra cost is modest — at ~12G
|
||
instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
|
||
~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
|
||
that use exactly this pattern:
|
||
|
||
- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
|
||
flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
|
||
specifically because CPU-side tessellation was the bottleneck, not fragment branching:
|
||
https://github.com/audulus/vger-rs
|
||
|
||
- **Randy Gaul's 2D renderer**: single pipeline with `shape_type` encoded as a vertex attribute.
|
||
Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because
|
||
"games tend to draw a lot of the same shape type":
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
|
||
#### What kind of branching IS expensive
|
||
|
||
For completeness, here are the cases where shader branching genuinely hurts — none of which apply to
|
||
our design:
|
||
|
||
1. **Per-fragment data-dependent branches with high divergence.** Example: `if (texture(noise, uv).r
|
||
|
||
> 0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every
|
||
> warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We
|
||
> have no per-fragment data-dependent branches in the main pipeline.
|
||
|
||
2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
|
||
divergent warps pay double a large cost. Our SDF kind branches are short (~15–30 instructions
|
||
each), and the gradient/texture/solid color selection branches are shorter still (3–8 instructions
|
||
each). Even fully divergent, the combined penalty is ~30–60 extra instructions — comparable to a
|
||
single texture sample's latency.
|
||
|
||
3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
|
||
across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
|
||
Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a
|
||
concern.
|
||
|
||
4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
|
||
split heavy effects into separate pipelines. Within the main pipeline, the four
|
||
SDF kind branches and flag-based color selection cluster at ~22–26 registers (see register
|
||
analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
|
||
Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.
|
||
|
||
**References:**
|
||
|
||
- ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution
|
||
models across GPU generations, confirms uniform and warp-coherent branches are free on modern
|
||
hardware:
|
||
https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
|
||
- Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that
|
||
warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the
|
||
SIMT divergence mechanism:
|
||
https://www.peterstefek.me/shader-branch.html
|
||
- NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows
|
||
divergent threads to reconverge more efficiently than older architectures:
|
||
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
|
||
- Randy Gaul on warp divergence in practice with per-primitive shape_type branching:
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
|
||
### Main pipeline: SDF + tessellated (unified)
|
||
|
||
The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
|
||
vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
|
||
(`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
|
||
vertex shader branches on this uniform to select the tessellated or SDF code path.
|
||
|
||
- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
|
||
(SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
|
||
user-provided raw vertex geometry.
|
||
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
|
||
structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
|
||
|
||
Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
|
||
`Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
|
||
and computes `out = color * t`; kinds 1–4 dispatch to one of four SDF functions (RRect, NGon,
|
||
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.
|
||
|
||
#### Why SDF for shapes
|
||
|
||
CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:
|
||
|
||
1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
|
||
= 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
|
||
vertices. That is roughly a 90× reduction per shape.
|
||
|
||
2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
|
||
animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
|
||
mathematically exact boundaries with perfect anti-aliasing via `smoothstep` in the fragment
|
||
shader.
|
||
|
||
3. **Feature cost.** Adding soft edges, outlines, stroke effects, or rounded-cap line segments
|
||
requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader
|
||
operations: `abs(d) - thickness` for stroke, `smoothstep(-soft, soft, d)` for soft edges.
|
||
|
||
**References:**
|
||
|
||
- Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used):
|
||
https://iquilezles.org/articles/distfunctions2d/
|
||
- Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference):
|
||
https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
|
||
- Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout,
|
||
warp divergence tradeoffs, and polyline rendering:
|
||
https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
|
||
- Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type
|
||
discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell
|
||
to 30 FPS due to CPU-side tessellation:
|
||
https://github.com/audulus/vger-rs
|
||
|
||
#### Storage-buffer instancing for SDF primitives
|
||
|
||
SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceIndex` in the vertex
|
||
shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
|
||
pattern used by both Zed GPUI and vger-rs.
|
||
|
||
Each SDF shape is described by a single `Primitive` struct (80 bytes) in the storage buffer. The
|
||
vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
|
||
vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
|
||
interpolated varyings.
|
||
|
||
Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
|
||
buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
|
||
80 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.
|
||
|
||
The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
|
||
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
|
||
in a draw call has the same mode — so it is effectively free on all modern GPUs.
|
||
|
||
#### Shape kinds and SDF dispatch
|
||
|
||
The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
|
||
four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
|
||
`pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
|
||
and set the kind automatically:
|
||
|
||
| User-facing proc | Shape_Kind | SDF function | Notes |
|
||
| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
|
||
| `rectangle` | `RRect` | `sdRoundedBox` | Per-corner radii from `radii` param |
|
||
| `rectangle_texture` | `RRect` | `sdRoundedBox` | Textured fill; `.Textured` flag set |
|
||
| `circle` | `RRect` | `sdRoundedBox` | Uniform radii = half-size (circle is a degenerate RRect) |
|
||
| `line`, `line_strip` | `RRect` | `sdRoundedBox` | Rotated capsule — stadium shape (radii = half-thickness) |
|
||
| `ellipse` | `Ellipse` | `sdEllipseApprox` | Approximate ellipse SDF (fast, suitable for UI) |
|
||
| `polygon` | `NGon` | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle |
|
||
| `ring` (full) | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping |
|
||
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask |
|
||
| `ring` (pie slice) | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |
|
||
|
||
The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
|
||
arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
|
||
definitions and bit assignments.
|
||
|
||
**What stays tessellated:**
|
||
|
||
- Text (SDL_ttf atlas, pending future MSDF evaluation)
|
||
- `tess.pixel` (single-pixel points)
|
||
- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
|
||
- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
|
||
- Any raw vertex geometry submitted via `prepare_shape`
|
||
|
||
The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
|
||
`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.
|
||
|
||
### Effects pipeline
|
||
|
||
The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and
|
||
similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a
|
||
dedicated pipeline state object that has its own compiled fragment shader.
|
||
|
||
#### Combined shape + effect rendering
|
||
|
||
When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn
|
||
**once**, entirely in the effects pipeline. The effects fragment shader evaluates both the effect
|
||
(blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated
|
||
across pipelines.
|
||
|
||
This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5)
|
||
and blur sigma 10:
|
||
|
||
- **Separate-primitive approach** (shape in main pipeline + shadow in effects pipeline): the shadow
|
||
quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500
|
||
shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the
|
||
shape. Total fragment invocations: ~49,900.
|
||
|
||
- **Combined approach** (one primitive in effects pipeline): one quad covers ~230×130 = 29,900
|
||
pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape
|
||
on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape
|
||
shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.
|
||
|
||
The combined approach uses **~40% fewer fragment invocations** per effected shape (29,900 vs 49,900)
|
||
in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF
|
||
evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments
|
||
through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10
|
||
shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.
|
||
|
||
An `Effect_Flag.Draw_Base_Shape` flag controls whether the sharp shape layer composites on top
|
||
(default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with
|
||
no shape on top) clear this flag.
|
||
|
||
Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects
|
||
are routed to the effects pipeline.
|
||
|
||
#### Drop shadow implementation
|
||
|
||
Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post
|
||
describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed
|
||
form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is
|
||
~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan
|
||
Wallace's variant) and vger-rs.
|
||
|
||
**References:**
|
||
|
||
- Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement):
|
||
https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
|
||
- Evan Wallace's original WebGL implementation (used by Figma):
|
||
https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
|
||
- Vello's implementation of blurred rounded rectangle as a gradient type:
|
||
https://github.com/linebender/vello/pull/665
|
||
|
||
### Backdrop pipeline
|
||
|
||
The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
|
||
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
|
||
register pressure.
|
||
|
||
**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
|
||
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
|
||
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
|
||
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
|
||
while also writing to it.
|
||
|
||
**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
|
||
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
|
||
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
|
||
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
|
||
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
|
||
require, keeping each sub-pass well under the 32-register cliff.
|
||
|
||
**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
|
||
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
|
||
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
|
||
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
|
||
is never entered and the texture copy never happens — zero cost.
|
||
|
||
**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
|
||
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
|
||
The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
|
||
sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
|
||
all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
|
||
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
|
||
bracket would have negligible impact on frame time.
|
||
|
||
### Vertex layout
|
||
|
||
The vertex struct is unchanged from the current 20-byte layout:
|
||
|
||
```
|
||
Vertex :: struct {
|
||
position: [2]f32, // 0: screen-space position
|
||
uv: [2]f32, // 8: atlas UV (text) or unused (shapes)
|
||
color: Color, // 16: u8x4, GPU-normalized to float
|
||
}
|
||
```
|
||
|
||
This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated
|
||
draws, `position` carries actual world-space geometry. For SDF draws, `position` carries unit-quad
|
||
corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
|
||
primitive's bounds.
|
||
|
||
The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
|
||
|
||
```
|
||
Primitive :: struct {
|
||
bounds: [4]f32, // 0: min_x, min_y, max_x, max_y
|
||
color: Color, // 16: u8x4, unpacked in shader via unpackUnorm4x8
|
||
flags: u32, // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
|
||
rotation_sc: u32, // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
|
||
_pad: f32, // 28: reserved for future use
|
||
params: Shape_Params, // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
|
||
uv: Uv_Or_Effects, // 64: texture UV rect or gradient/outline parameters (16 bytes)
|
||
}
|
||
// Total: 80 bytes (std430 aligned)
|
||
```
|
||
|
||
`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
|
||
`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
|
||
writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
|
||
`Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
|
||
v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
|
||
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
|
||
width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
|
||
via `pack_kind_flags`.
|
||
|
||
### Draw submission order
|
||
|
||
Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:
|
||
|
||
1. Bind **effects pipeline** → draw all queued effects primitives for this scissor (instanced, one
|
||
draw call). Each effects primitive includes its base shape and composites internally.
|
||
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
|
||
shapes, indexed for text). Pipeline state unchanged from today.
|
||
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
|
||
4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
|
||
backdrop primitives.
|
||
|
||
The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
|
||
invariant is that each primitive is drawn exactly once, in the pipeline that owns it.
|
||
|
||
### Text rendering
|
||
|
||
Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per `(font, size)`
|
||
pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData`. This path is
|
||
**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
|
||
mode with `mode = 0`, sampling the SDL_ttf atlas texture.
|
||
|
||
A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
|
||
allow resolution-independent glyph rendering from a single small atlas per font. This would involve:
|
||
|
||
- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
|
||
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
|
||
- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
|
||
already exists for the four current SDF kinds).
|
||
- Potential removal of the SDL_ttf dependency.
|
||
|
||
This is explicitly deferred. The SDF shape migration is independent of and does not block text
|
||
changes.
|
||
|
||
**References:**
|
||
|
||
- Viktor Chlumský's MSDF master's thesis and msdfgen tool:
|
||
https://github.com/Chlumsky/msdfgen
|
||
- MSDF atlas generator for font atlas packing:
|
||
https://github.com/Chlumsky/msdf-atlas-gen
|
||
- Valve's original SDF text rendering paper (SIGGRAPH 2007):
|
||
https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
|
||
|
||
### Textures
|
||
|
||
Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The
|
||
work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs
|
||
that route into the existing tessellated and SDF paths respectively.
|
||
|
||
#### Why draw owns registered textures
|
||
|
||
A texture's GPU resource (the `^sdl.GPUTexture`, transfer buffer, shader resource view) is created
|
||
and destroyed by draw. The user provides raw bytes and a descriptor at registration time; draw
|
||
uploads synchronously and returns an opaque `Texture_Id` handle. The user can free their CPU-side
|
||
bytes immediately after `register_texture` returns.
|
||
|
||
This follows the model used by the RAD Debugger's render layer (`src/render/render_core.h` in
|
||
EpicGamesExt/raddebugger, MIT license), where `r_tex2d_alloc` takes `(kind, size, format, data)`
|
||
and returns an opaque handle that the renderer owns and releases. The single-owner model eliminates
|
||
an entire class of lifecycle bugs (double-free, use-after-free across subsystems, unclear cleanup
|
||
responsibility) that dual-ownership designs introduce.
|
||
|
||
If advanced interop is ever needed (e.g., a future 3D pipeline or compute shader sharing the same
|
||
GPU texture), the clean extension is a borrowed-reference accessor (`get_gpu_texture(id)`) that
|
||
returns the underlying handle without transferring ownership. This is purely additive and does not
|
||
require changing the registration API.
|
||
|
||
#### Why `Texture_Kind` exists
|
||
|
||
`Texture_Kind` (Static / Dynamic / Stream) is a driver hint for update frequency, adopted from the
|
||
RAD Debugger's `R_ResourceKind`. It maps directly to SDL3 GPU usage patterns:
|
||
|
||
- **Static**: uploaded once, never changes. Covers QR codes, decoded PNGs, icons — the 90% case.
|
||
- **Dynamic**: updatable via `update_texture_region`. Covers font atlas growth, procedural updates.
|
||
- **Stream**: frequent full re-uploads. Covers video playback, per-frame procedural generation.
|
||
|
||
This costs one byte in the descriptor and lets the backend pick optimal memory placement without a
|
||
future API change.
|
||
|
||
#### Why samplers are per-draw, not per-texture
|
||
|
||
A sampler describes how to filter and address a texture during sampling — nearest vs bilinear, clamp
|
||
vs repeat. This is a property of the _draw_, not the texture. The same QR code texture should be
|
||
sampled with `Nearest_Clamp` when displayed at native resolution but could reasonably be sampled
|
||
with `Linear_Clamp` in a zoomed-out thumbnail. The same icon atlas might be sampled with
|
||
`Nearest_Clamp` for pixel art or `Linear_Clamp` for smooth scaling.
|
||
|
||
The RAD Debugger follows this pattern: `R_BatchGroup2DParams` carries `tex_sample_kind` alongside
|
||
the texture handle, chosen per batch group at draw time. We do the same — `Sampler_Preset` is a
|
||
parameter on the draw procs, not a field on `Texture_Desc`.
|
||
|
||
Internally, draw keeps a small pool of pre-created `^sdl.GPUSampler` objects (one per preset,
|
||
lazily initialized). Sub-batch coalescing keys on `(kind, texture_id, sampler_preset)` — draws
|
||
with the same texture but different samplers produce separate draw calls, which is correct.
|
||
|
||
#### Textured draw procs
|
||
|
||
Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
|
||
`rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
|
||
parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.
|
||
|
||
An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
|
||
quads, on the theory that the tessellated path's lower register count would improve occupancy at
|
||
large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
|
||
full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
|
||
extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
|
||
splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU side per scissor,
|
||
matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.
|
||
|
||
SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
|
||
`circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
|
||
parameters on each proc rather than separate overloads. Future per-shape texture variants
|
||
(`circle_texture`, `ellipse_texture`) are additive.
|
||
|
||
#### What SDF anti-aliasing does and does not do for textured draws
|
||
|
||
The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
|
||
outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
|
||
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
|
||
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
|
||
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
|
||
depends on how closely the display size matches the SDL_ttf atlas's rasterized size.
|
||
|
||
#### Fit modes are a computation layer, not a renderer concept
|
||
|
||
Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV
|
||
sub-region computations on top of the `uv_rect` parameter that both textured-draw procs accept. The
|
||
renderer has no knowledge of fit modes — it samples whatever UV region it is given.
|
||
|
||
A `fit_params` helper computes the appropriate `uv_rect`, sampler preset, and (for letterbox/fit
|
||
mode) shrunken inner rect from a `Fit_Mode` enum, the target rect, and the texture's pixel size.
|
||
Users who need custom UV control (sprite atlas sub-regions, UV animation, nine-patch slicing) skip
|
||
the helper and compute `uv_rect` directly. This keeps the renderer primitive minimal while making
|
||
the common cases convenient.
|
||
|
||
#### Deferred release
|
||
|
||
`unregister_texture` does not immediately release the GPU texture. It queues the slot for release at
|
||
the end of the current frame, after `SubmitGPUCommandBuffer` has handed work to the GPU. This
|
||
prevents a race condition where a texture is freed while the GPU is still sampling from it in an
|
||
already-submitted command buffer. The same deferred-release pattern is applied to `clear_text_cache`
|
||
and `clear_text_cache_entry`, fixing a pre-existing latent bug where destroying a cached
|
||
`^sdl_ttf.Text` mid-frame could free an atlas texture still referenced by in-flight draw batches.
|
||
|
||
This pattern is standard in production renderers — the RAD Debugger's `r_tex2d_release` queues
|
||
textures onto a free list that is processed in `r_end_frame`, not at the call site.
|
||
|
||
#### Clay integration
|
||
|
||
Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
|
||
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
|
||
existing rectangle handling: `fit_params` computes UVs from the fit mode, then
|
||
`rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
|
||
from Clay's `cornerRadius` otherwise).
|
||
|
||
#### Deferred features
|
||
|
||
The following are plumbed in the descriptor but not implemented in phase 1:
|
||
|
||
- **Mipmaps**: `Texture_Desc.mip_levels` field exists; generation via SDL3 deferred.
|
||
- **Compressed formats**: `Texture_Desc.format` accepts BC/ASTC; upload path deferred.
|
||
- **Render-to-texture**: `Texture_Desc.usage` accepts `.COLOR_TARGET`; render-pass refactor deferred.
|
||
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
|
||
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
|
||
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
|
||
- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.
|
||
|
||
**References:**
|
||
|
||
- RAD Debugger render layer (ownership model, deferred release, sampler-at-draw-time):
|
||
https://github.com/EpicGamesExt/raddebugger — `src/render/render_core.h`, `src/render/d3d11/render_d3d11.c`
|
||
- Casey Muratori, Handmade Hero day 472 — texture handling as a renderer-owned resource concern,
|
||
atlases as a separate layer above the renderer.
|
||
|
||
## 3D rendering
|
||
|
||
3D pipeline architecture is under consideration and will be documented separately. The current
|
||
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
|
||
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.
|
||
|
||
## Multi-window support
|
||
|
||
The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
|
||
deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
|
||
(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.
|
||
|
||
RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
|
||
that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
|
||
it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
|
||
the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
|
||
per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
|
||
(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
|
||
bucket, so different windows can have independent clip and transform state without interference.
|
||
|
||
The key properties this gives RAD:
|
||
|
||
- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
|
||
No global contention.
|
||
- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
|
||
build buckets concurrently, then submit them to the render backend sequentially.
|
||
- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
|
||
bucket with a transform applied, without rebuilding its draw calls.
|
||
|
||
For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
|
||
context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
|
||
style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
|
||
global (shared across windows), with only the per-frame staging buffers and layer/scissor state
|
||
becoming per-context.
|
||
|
||
## Building shaders
|
||
|
||
GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
|
||
are generated into `shaders/generated/` via the meta tool:
|
||
|
||
```
|
||
odin run meta -- gen-shaders
|
||
```
|
||
|
||
Requires `glslangValidator` and `spirv-cross` on PATH.
|
||
|
||
### Shader format selection
|
||
|
||
The library embeds shader bytecode per compile target — MSL + `main0` entry point on Darwin (via
|
||
`spirv-cross --msl`, which renames `main` because it is reserved in Metal), SPIR-V + `main` entry
|
||
point elsewhere. Three compile-time constants in `draw.odin` expose the build's shader configuration:
|
||
|
||
| Constant | Type | Darwin | Other |
|
||
| ----------------------------- | ------------------------- | --------- | ---------- |
|
||
| `PLATFORM_SHADER_FORMAT_FLAG` | `sdl.GPUShaderFormatFlag` | `.MSL` | `.SPIRV` |
|
||
| `PLATFORM_SHADER_FORMAT` | `sdl.GPUShaderFormat` | `{.MSL}` | `{.SPIRV}` |
|
||
| `SHADER_ENTRY` | `cstring` | `"main0"` | `"main"` |
|
||
|
||
Pass `PLATFORM_SHADER_FORMAT` to `sdl.CreateGPUDevice` so SDL selects a backend compatible with the
|
||
embedded bytecode:
|
||
|
||
```
|
||
gpu := sdl.CreateGPUDevice(draw.PLATFORM_SHADER_FORMAT, true, nil)
|
||
```
|
||
|
||
At init time the library calls `sdl.GetGPUShaderFormats(device)` to verify the active backend
|
||
accepts `PLATFORM_SHADER_FORMAT_FLAG`. If it does not, `draw.init` returns `false` with a
|
||
descriptive log message showing both the embedded and active format sets.
|