levlib/draw/README.md

# draw

2D rendering library built on SDL3 GPU, providing a unified shape-drawing and text-rendering API with
Clay UI integration.

## Current state

The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
modes dispatched by a push constant:

- **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
  SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
  (`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
  `tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
  shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.

- **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
  `Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
  reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
  primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
  `Primitive.flags`) to evaluate one of four signed distance functions:
  - **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
    circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
    radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
  - **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
  - **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
  - **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
    normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).

All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
gradients, and texture fills via `Shape_Flags` (see `pipeline_2d_base.odin`). Gradient and outline
parameters are packed into the same 16 bytes as the texture UV rect via a `Uv_Or_Effects` raw union
— zero size increase to the 80-byte `Primitive` struct. Gradient/outline and texture are mutually
exclusive.

All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
instead of ~250 vertices (~5000 bytes).

The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
in the pipeline plan below for the full cliff/margin analysis and SBC architecture context). The
fragment shader's estimated peak footprint is ~22–26 fp32 VGPRs (~16–22 fp16 VGPRs on architectures
with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
like `f_color`, `f_uv_or_effects`, and `half_size`). RRect is 1–2 regs lower (`corner_radii` vec4
replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
2–4 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).

MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
glyph edges and tessellated user geometry if desired.

## 2D rendering pipeline plan

This section documents the planned architecture for levlib's 2D rendering system. The design is driven
by three goals: **draw quality** (mathematically exact curves with perfect anti-aliasing), **efficiency**
(minimal vertex bandwidth, high GPU occupancy, low draw-call count), and **extensibility** (new
primitives and effects can be added to the library without architectural changes).

### Overview: three pipelines

The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
**render-pass structure** (everything vs backdrop):

1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
   **≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
   in a typical frame.

2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
   effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
   occupancy at the first cliff is accepted by design). Each effects primitive includes the base
   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
   redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
   low-end hardware (see register analysis below).

3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
   target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
   where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
   Valhall). Separated from the other pipelines because it structurally requires ending the current
   render pass and copying the render target before any backdrop-sampling fragment can execute — a
   command-buffer-level boundary that cannot be avoided regardless of shader complexity.

A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
switches plus 1 texture copy. At ~1–5μs per pipeline bind on modern APIs, worst-case switching
overhead is negligible relative to an 8.3ms (120 FPS) frame budget.

### Why three pipelines, not one or seven

The natural question is whether we should use a single unified pipeline (fewer state changes, simpler
code) or many per-primitive-type pipelines (no branching overhead, lean per-shader register usage).

#### Main/effects split: register pressure

A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
allocates registers pessimistically based on the worst-case path through the shader. If the shader
contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
latency.

Each GPU architecture has discrete **occupancy cliffs** — register counts above which the number of
concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
register over, throughput drops sharply.

**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
dominant current GPU architecture:

- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
  **Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
  registers**, second cliff at **64 registers**.
- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
  same cliff structure — first at 32, second at 64.
- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~2016–2018): first cliff at **16 registers**.
  Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
  below.
- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
  the current SBC market. See Known limitations below.
- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
  Register allocation happens at runtime based on actual usage.
- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.

**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
registers of margin**:

| Pipeline            | Cliff targeted         | Margin | Register budget   | Rationale                                                                                     |
| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
| Main pipeline       | 32 (Valhall 1st cliff) | 8      | **≤24 regs**      | Handles 90%+ of frame fragments; must run at full occupancy                                   |
| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8      | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy                    |
| Effects pipeline    | 64 (Valhall 2nd cliff) | 8      | **≤56 regs**      | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |

**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
counts upward over a shader's lifetime:

1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
   allocators. Shaders typically drift ±2–3 registers between versions on unchanged source.
2. **Feature additions.** Each new effect, flag, or uniform adds 1–4 live registers. A new gradient
   mode or outline option lands in this range.
3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
   or a contributor not knowing) costs 2 registers per affected `vec4`.

Realistic creep over a couple of years is 4–8 registers. The cost of conservatism is zero — a shader
at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.

**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
actually render effects (~5–10% in a typical UI) run at reduced occupancy.

For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
compiler allocates registers for the worst-case path.

The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
50–67% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
that effects cover.

**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
texture-copy out of the main render pass.

**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.

#### Known limitations: V3D and Bifrost (16-register cliff)

Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
reduced occupancy regardless of which shape kind or effect is active.

Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
register footprint under 16). This conflicts with the unified-pipeline design that enables single
draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
pipelines" below.

We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
allocation).

#### Verifying register counts

The register estimates in this document are hand-counted via manual live-range analysis (see Current
state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
measured `malioc` numbers is a follow-up task.

#### Backdrop split: render-pass structure

The backdrop pipeline (frosted glass, refraction, mirror surfaces) is separated for a structural
reason unrelated to register pressure. Before any backdrop-sampling fragment can execute, the current
render target must be copied to a separate texture via `CopyGPUTextureToTexture` — a command-buffer-
level operation that requires ending the current render pass. This boundary exists regardless of
shader complexity and cannot be optimized away.

The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
pass sequence.

#### Why not per-primitive-type pipelines (GPUI's approach)

Zed's GPUI uses 7 separate shader pairs:
quad, shadow, underline, monochrome sprite, polychrome sprite, path, surface. This eliminates all
branching and gives each shader minimal register usage. Three concrete costs make this approach wrong
for our use case:

**Draw call count scales with kind variety, not just scissor count.** With a unified pipeline,
one instanced draw call per scissor covers all primitive kinds from a single storage buffer. With
per-kind pipelines, each scissor requires one draw call and one pipeline bind per kind used. For a
typical UI frame with 15 scissors and 3–4 primitive kinds per scissor, per-kind splitting produces
~45–60 draw calls and pipeline binds; our unified approach produces ~15–20 draw calls and 1–5
pipeline binds. At ~5μs each for CPU-side command encoding on modern APIs, per-kind splitting adds
375–500μs of CPU overhead per frame — **4.5–6% of an 8.3ms (120 FPS) budget** — with no
compensating GPU-side benefit, because the register-pressure savings within the simple-SDF range are
negligible (all members cluster at 12–22 registers).

**Z-order preservation forces the API to expose layers.** With a single pipeline drawing all kinds
from one storage buffer, submission order equals draw order — Clay's painterly render commands flow
through without reordering. With separate pipelines per kind, primitives can only batch with
same-kind neighbors, which means interleaved kinds (e.g., `[rrect, circle, text, rrect, text]`) must
either issue one draw call per primitive (defeating batching entirely) or force the user to pre-sort
by kind and reason about explicit layers. GPUI chose the latter, baking layer semantics into their
API where each layer draws shadows before quads before glyphs. Our design avoids this constraint:
submission order is draw order, no layer juggling required.

**PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
additional axis with 7 pipelines vs 3.

**Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
the strongest candidate for per-kind splitting because effect branches are heavier than shape
branches (~80 instructions for drop shadow vs ~20 for an SDF). Even here, per-kind splitting loses.
Consider a worst-case scissor with 15 drop-shadowed cards and 2 inner-shadowed elements interleaved
in submission order:

- _Unified effects pipeline (our plan):_ 1 pipeline bind, 1 instanced draw call. Category-3
  divergence occurs at drop-shadow/inner-shadow boundaries where ~4 warps straddle per boundary × 2
  boundaries = ~8 divergent warps out of ~19,924 total (0.04%). Each divergent warp pays ~80 extra
  instructions. Total divergence cost: 8 × 32 × 80 / 12G inst/sec ≈ **1.7μs**.

- _Per-kind effects pipelines (GPUI-style):_ 2 pipeline binds + 2 draw calls. But submission order
  is `[drop, drop, inner, drop, drop, inner, drop, ...]` — the two inner-shadow primitives split the
  drop-shadow run into three segments. To preserve Z-order, this requires 5 draw calls and 4 pipeline
  switches, not 2. Cost: 5 × 5μs + 4 × 5μs = **45μs**.

  The per-kind approach costs **26× more** than the unified approach's divergence penalty (45μs vs
  1.7μs), while eliminating only 0.04% warp divergence that was already negligible. Even in the most
  extreme stacked-effects scenario (10 cards each with both drop shadow and inner shadow, producing
  ~60 boundary warps at ~80 extra instructions each), unified divergence costs ~13μs — still 3.5×
  cheaper than the pipeline-switching alternative.

The split we _do_ perform (main / effects / backdrop) is motivated by register-pressure boundaries
and structural render-pass requirements (see analysis above). Within a pipeline, unified is
strictly better by every measure: fewer draw calls, simpler Z-order, lower CPU overhead, and
negligible GPU-side branching cost.

**References:**

- Zed GPUI blog post on their per-primitive pipeline architecture:
  https://zed.dev/blog/videogame
- Zed GPUI Metal shader source (7 shader pairs):
  https://github.com/zed-industries/zed/blob/cb6fc11/crates/gpui/src/platform/mac/shaders.metal
- NVIDIA Nsight Graphics 2024.3 documentation on active-threads-per-warp and divergence analysis:
  https://developer.nvidia.com/blog/optimize-gpu-workloads-for-graphics-applications-with-nvidia-nsight-graphics/
- NVIDIA Ampere GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.6, 64 for
  cc 8.0), register file size (64K), occupancy factors:
  https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html
- NVIDIA Ada GPU Architecture Tuning Guide — SM specs, max warps per SM (48 for cc 8.9):
  https://docs.nvidia.com/cuda/ada-tuning-guide/index.html
- CUDA Occupancy Calculation walkthrough (register allocation granularity, worked examples):
  https://leimao.github.io/blog/CUDA-Occupancy-Calculation/
- Apple M3 GPU architecture — Dynamic Caching (register file virtualization) eliminates static
  worst-case register allocation, reducing the occupancy penalty for high-register shaders:
  https://asplos.dev/wiki/m3-chip-explainer/gpu/index.html

### Why fragment shader branching is safe in this design

There is longstanding folklore that "branches in shaders are bad." This was true on pre-2010 hardware
where shader cores had no branch instructions at all — compilers emitted code for both sides of every
branch and used conditional select to pick the result. On modern GPUs (everything from ~2012 onward),
this is no longer the case. Native dynamic branching is fully supported on all current hardware.
However, branching _can_ still be costly in specific circumstances. Understanding which circumstances
apply to our design — and which do not — is critical to justifying the unified-pipeline approach.

#### How GPU branching works

GPUs execute fragment shaders in **warps** (NVIDIA/Intel, 32 threads) or **wavefronts** (AMD, 32 or
64 threads). All threads in a warp execute the same instruction simultaneously (SIMT model). When a
branch condition evaluates the same way for every thread in a warp, the GPU simply jumps to the taken
path and skips the other — **zero cost**, identical to a CPU branch. This is called a **uniform
branch** or **warp-coherent branch**.

When threads within the same warp disagree on which path to take, the warp must execute both paths
sequentially, masking off threads that don't belong to the active path. This is called **warp
divergence** and it causes the warp to pay the cost of both sides of the branch. In the worst case
(50/50 split), throughput halves for that warp.

There are three categories of branch condition in a fragment shader, ranked by cost:

| Category                         | Condition source                                                  | GPU behavior                                                                                   | Cost                  |
| -------------------------------- | ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------- |
| **Compile-time constant**        | `#ifdef`, `const bool`                                            | Dead code eliminated by compiler                                                               | Zero                  |
| **Uniform / push constant**      | Same value for entire draw call                                   | Warp-coherent; GPU skips dead path                                                             | Effectively zero      |
| **Per-primitive `flat` varying** | Same value across all fragments of a primitive                    | Warp-coherent for all warps fully inside one primitive; divergent only at primitive boundaries | Near zero (see below) |
| **Per-fragment varying**         | Different value per pixel (e.g., texture lookup, screen position) | Potentially divergent within every warp                                                        | Can be expensive      |

#### Which category our branches fall into

Our design has three branch points:

1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
   Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
   cost.**

2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
   The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
   to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
   kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~15–30
   instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
   different kinds.

3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
   The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
   solid color selection and outline rendering — all lightweight branches (3–8 instructions per
   path). Divergence at primitive boundaries between different flag combinations has negligible cost.

For category 3, the divergence analysis depends on primitive size:

- **Large primitives** (buttons, panels, containers — 50+ pixels on a side): a 200×100 rect
  produces ~20,000 fragments = ~625 warps. At most ~4 boundary warps might straddle a neighbor of a
  different kind. Divergence rate: **0.6%** of warps.

- **Small primitives** (icons, dots — 16×16): 256 fragments = ~8 warps. At most 2 boundary warps
  diverge. Divergence rate: **25%** of warps for that primitive, but the primitive itself covers a
  tiny fraction of the frame's total fragments.

- **Worst realistic case**: a dense grid of alternating shape kinds (e.g., circle-rect-circle-rect
  icons). Even here, the interior warps of each primitive are coherent. Only the edges diverge. Total
  frame-level divergence is typically **1–3%** of all warps.

At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
of both (~45–60 instructions total). Each divergent warp's extra cost is modest — at ~12G
instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
that use exactly this pattern:

- **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
  flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
  specifically because CPU-side tessellation was the bottleneck, not fragment branching:
  https://github.com/audulus/vger-rs

- **Randy Gaul's 2D renderer**: single pipeline with `shape_type` encoded as a vertex attribute.
  Reports that warp divergence "really hasn't been an issue for any game I've seen so far" because
  "games tend to draw a lot of the same shape type":
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

#### What kind of branching IS expensive

For completeness, here are the cases where shader branching genuinely hurts — none of which apply to
our design:

1. **Per-fragment data-dependent branches with high divergence.** Example: `if (texture(noise, uv).r

   > 0.5)` where the noise texture produces a random pattern. Every warp has ~50% divergence. Every
   > warp pays for both paths. This is the scenario the "branches are bad" folklore warns about. We
   > have no per-fragment data-dependent branches in the main pipeline.

2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
   divergent warps pay double a large cost. Our SDF kind branches are short (~15–30 instructions
   each), and the gradient/texture/solid color selection branches are shorter still (3–8 instructions
   each). Even fully divergent, the combined penalty is ~30–60 extra instructions — comparable to a
   single texture sample's latency.

3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
   across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
   Volta+, AMD RDNA+, Apple M-series) use scalar+vector execution models where this is not a
   concern.

4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
   split heavy effects into separate pipelines. Within the main pipeline, the four
   SDF kind branches and flag-based color selection cluster at ~22–26 registers (see register
   analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
   Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.

**References:**

- ARM solidpixel blog on branches in mobile shaders — comprehensive taxonomy of branch execution
  models across GPU generations, confirms uniform and warp-coherent branches are free on modern
  hardware:
  https://solidpixel.github.io/2021/12/09/branches_in_shaders.html
- Peter Stefek's "A Note on Branching Within a Shader" — practical measurements showing that
  warp-coherent branches have zero overhead on Pascal/Volta/Ampere, with clear explanation of the
  SIMT divergence mechanism:
  https://www.peterstefek.me/shader-branch.html
- NVIDIA Volta architecture whitepaper — documents independent thread scheduling which allows
  divergent threads to reconverge more efficiently than older architectures:
  https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- Randy Gaul on warp divergence in practice with per-primitive shape_type branching:
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html

### Main pipeline: SDF + tessellated (unified)

The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms` push constant
(`Draw_Mode.Tessellated = 0`, `Draw_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
vertex shader branches on this uniform to select the tessellated or SDF code path.

- **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
  (SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
  user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
  structs, drawn instanced. Used for all shapes with closed-form signed distance functions.

Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
`Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture sample
and computes `out = color * t`; kinds 1–4 dispatch to one of four SDF functions (RRect, NGon,
Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.

#### Why SDF for shapes

CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:

1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
   = 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
   vertices. That is roughly a 90× reduction per shape.

2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
   animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
   mathematically exact boundaries with perfect anti-aliasing via `smoothstep` in the fragment
   shader.

3. **Feature cost.** Adding soft edges, outlines, stroke effects, or rounded-cap line segments
   requires extensive per-shape tessellation code. With SDF, these are trivial fragment shader
   operations: `abs(d) - thickness` for stroke, `smoothstep(-soft, soft, d)` for soft edges.

**References:**

- Inigo Quilez's 2D SDF primitive catalog (primary source for all SDF functions used):
  https://iquilezles.org/articles/distfunctions2d/
- Valve's 2007 SIGGRAPH paper on SDF for vector textures and glyphs (foundational reference):
  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf
- Randy Gaul's practical writeup on SDF 2D rendering with shape-type branching, attribute layout,
  warp divergence tradeoffs, and polyline rendering:
  https://randygaul.github.io/graphics/2025/03/04/2D-Rendering-SDF-and-Atlases.html
- Audulus vger-rs — production 2D renderer using a single unified pipeline with SDF type
  discriminant, same architecture as this plan. Replaced nanovg, achieving 120 FPS where nanovg fell
  to 30 FPS due to CPU-side tessellation:
  https://github.com/audulus/vger-rs

#### Storage-buffer instancing for SDF primitives

SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceIndex` in the vertex
shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
pattern used by both Zed GPUI and vger-rs.

Each SDF shape is described by a single `Primitive` struct (80 bytes) in the storage buffer. The
vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
interpolated varyings.

Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
80 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.

The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
in a draw call has the same mode — so it is effectively free on all modern GPUs.

#### Shape kinds and SDF dispatch

The fragment shader dispatches on `Shape_Kind` (low byte of `Primitive.flags`) to evaluate one of
four signed distance functions. The `Shape_Kind` enum and per-kind `*_Params` structs are defined in
`pipeline_2d_base.odin`. CPU-side drawing procs in `shapes.odin` build the appropriate `Primitive`
and set the kind automatically:

| User-facing proc     | Shape_Kind | SDF function       | Notes                                                      |
| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
| `rectangle`          | `RRect`    | `sdRoundedBox`     | Per-corner radii from `radii` param                        |
| `rectangle_texture`  | `RRect`    | `sdRoundedBox`     | Textured fill; `.Textured` flag set                        |
| `circle`             | `RRect`    | `sdRoundedBox`     | Uniform radii = half-size (circle is a degenerate RRect)   |
| `line`, `line_strip` | `RRect`    | `sdRoundedBox`     | Rotated capsule — stadium shape (radii = half-thickness)   |
| `ellipse`            | `Ellipse`  | `sdEllipseApprox`  | Approximate ellipse SDF (fast, suitable for UI)            |
| `polygon`            | `NGon`     | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle              |
| `ring` (full)        | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping       |
| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask           |
| `ring` (pie slice)   | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |

The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
arc geometry). See the `Shape_Flag` enum in `pipeline_2d_base.odin` for the authoritative flag
definitions and bit assignments.

**What stays tessellated:**

- Text (SDL_ttf atlas, pending future MSDF evaluation)
- `tess.pixel` (single-pixel points)
- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
- Any raw vertex geometry submitted via `prepare_shape`

The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.

### Effects pipeline

The effects pipeline handles blur-based visual effects: drop shadows, inner shadows, outer glow, and
similar. It uses the same storage-buffer instancing pattern as the main pipeline's SDF path, with a
dedicated pipeline state object that has its own compiled fragment shader.

#### Combined shape + effect rendering

When a shape has an effect (e.g., a rounded rectangle with a drop shadow), the shape is drawn
**once**, entirely in the effects pipeline. The effects fragment shader evaluates both the effect
(blur math) and the base shape's SDF, compositing them in a single pass. The shape is not duplicated
across pipelines.

This avoids redundant overdraw. Consider a 200×100 rounded rect with a drop shadow offset by (5, 5)
and blur sigma 10:

- **Separate-primitive approach** (shape in main pipeline + shadow in effects pipeline): the shadow
  quad covers ~230×130 = 29,900 pixels, the shape quad covers 200×100 = 20,000 pixels. The ~18,500
  shadow fragments underneath the shape run the expensive blur shader only to be overwritten by the
  shape. Total fragment invocations: ~49,900.

- **Combined approach** (one primitive in effects pipeline): one quad covers ~230×130 = 29,900
  pixels. The fragment shader evaluates the blur, then evaluates the shape SDF, composites the shape
  on top. Total fragment invocations: ~29,900. The 20,000 shape-region fragments run the blur+shape
  shader, but the shape SDF evaluation adds only ~15 ops to an ~80 op blur shader.

The combined approach uses **~40% fewer fragment invocations** per effected shape (29,900 vs 49,900)
in the common opaque case. The shape-region fragments pay a small additional cost for shape SDF
evaluation in the effects shader (~15 ops), but this is far cheaper than running 18,500 fragments
through the full blur shader (~80 ops each) and then discarding their output. For a UI with 10
shadowed elements, the combined approach saves roughly 200,000 fragment invocations per frame.

An `Effect_Flag.Draw_Base_Shape` flag controls whether the sharp shape layer composites on top
(default true for drop shadow, always true for inner shadow). Standalone effects (e.g., a glow with
no shape on top) clear this flag.

Shapes without effects are submitted to the main pipeline as normal. Only shapes that have effects
are routed to the effects pipeline.

#### Drop shadow implementation

Drop shadows use the analytical blurred-rounded-rectangle technique. Raph Levien's 2020 blog post
describes an erf-based approximation that computes a Gaussian-blurred rounded rectangle in closed
form along one axis and with a 4-sample numerical integration along the other. Total fragment cost is
~80 FLOPs, one sqrt, no texture samples. This is the same technique used by Zed GPUI (via Evan
Wallace's variant) and vger-rs.

**References:**

- Raph Levien's blurred rounded rectangles post (erf approximation, squircle contour refinement):
  https://raphlinus.github.io/graphics/2020/04/21/blurred-rounded-rects.html
- Evan Wallace's original WebGL implementation (used by Figma):
  https://madebyevan.com/shaders/fast-rounded-rectangle-shadows/
- Vello's implementation of blurred rounded rectangle as a gradient type:
  https://github.com/linebender/vello/pull/665

### Backdrop pipeline

The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
register pressure.

**Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
while also writing to it.

**Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
require, keeping each sub-pass well under the 32-register cliff.

**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
is never entered and the texture copy never happens — zero cost.

**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
bracket would have negligible impact on frame time.

### Vertex layout

The vertex struct is unchanged from the current 20-byte layout:

```
Vertex :: struct {
    position: [2]f32,  //  0: screen-space position
    uv:       [2]f32,  //  8: atlas UV (text) or unused (shapes)
    color:    Color,   // 16: u8x4, GPU-normalized to float
}
```

This layout is shared between the tessellated path and the SDF unit-quad vertices. For tessellated
draws, `position` carries actual world-space geometry. For SDF draws, `position` carries unit-quad
corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
primitive's bounds.

The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:

```
Primitive :: struct {
    bounds:      [4]f32,          //  0: min_x, min_y, max_x, max_y
    color:       Color,           // 16: u8x4, unpacked in shader via unpackUnorm4x8
    flags:       u32,             // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
    rotation_sc: u32,             // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
    _pad:        f32,             // 28: reserved for future use
    params:      Shape_Params,    // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
    uv:          Uv_Or_Effects,   // 64: texture UV rect or gradient/outline parameters (16 bytes)
}
// Total: 80 bytes (std430 aligned)
```

`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `pipeline_2d_base.odin`. Each SDF kind
writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
`Uv_Or_Effects` is a `#raw_union` that aliases `[4]f32` (texture UV rect: u_min, v_min, u_max,
v_max) with a `Gradient_Outline` struct containing `gradient_color: Color`, `outline_color: Color`,
`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
width). The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits 8+
via `pack_kind_flags`.

### Draw submission order

Within each scissor region, draws are issued in submission order to preserve the painter's algorithm:

1. Bind **effects pipeline** → draw all queued effects primitives for this scissor (instanced, one
   draw call). Each effects primitive includes its base shape and composites internally.
2. Bind **main pipeline, tessellated mode** → draw all queued tessellated vertices (non-indexed for
   shapes, indexed for text). Pipeline state unchanged from today.
3. Bind **main pipeline, SDF mode** → draw all queued SDF primitives (instanced, one draw call).
4. If backdrop effects are present: copy render target, bind **backdrop pipeline** → draw
   backdrop primitives.

The exact ordering within a scissor may be refined based on actual Z-ordering requirements. The key
invariant is that each primitive is drawn exactly once, in the pipeline that owns it.

### Text rendering

Text rendering currently uses SDL_ttf's GPU text engine, which rasterizes glyphs per `(font, size)`
pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData`. This path is
**unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
mode with `mode = 0`, sampling the SDL_ttf atlas texture.

A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
allow resolution-independent glyph rendering from a single small atlas per font. This would involve:

- Offline atlas generation via Chlumský's msdf-atlas-gen tool.
- Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
  already exists for the four current SDF kinds).
- Potential removal of the SDL_ttf dependency.

This is explicitly deferred. The SDF shape migration is independent of and does not block text
changes.

**References:**

- Viktor Chlumský's MSDF master's thesis and msdfgen tool:
  https://github.com/Chlumsky/msdfgen
- MSDF atlas generator for font atlas packing:
  https://github.com/Chlumsky/msdf-atlas-gen
- Valve's original SDF text rendering paper (SIGGRAPH 2007):
  https://steamcdn-a.akamaihd.net/apps/valve/2007/SIGGRAPH2007_AlphaTestedMagnification.pdf

### Textures

Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The
work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs
that route into the existing tessellated and SDF paths respectively.

#### Why draw owns registered textures

A texture's GPU resource (the `^sdl.GPUTexture`, transfer buffer, shader resource view) is created
and destroyed by draw. The user provides raw bytes and a descriptor at registration time; draw
uploads synchronously and returns an opaque `Texture_Id` handle. The user can free their CPU-side
bytes immediately after `register_texture` returns.

This follows the model used by the RAD Debugger's render layer (`src/render/render_core.h` in
EpicGamesExt/raddebugger, MIT license), where `r_tex2d_alloc` takes `(kind, size, format, data)`
and returns an opaque handle that the renderer owns and releases. The single-owner model eliminates
an entire class of lifecycle bugs (double-free, use-after-free across subsystems, unclear cleanup
responsibility) that dual-ownership designs introduce.

If advanced interop is ever needed (e.g., a future 3D pipeline or compute shader sharing the same
GPU texture), the clean extension is a borrowed-reference accessor (`get_gpu_texture(id)`) that
returns the underlying handle without transferring ownership. This is purely additive and does not
require changing the registration API.

#### Why `Texture_Kind` exists

`Texture_Kind` (Static / Dynamic / Stream) is a driver hint for update frequency, adopted from the
RAD Debugger's `R_ResourceKind`. It maps directly to SDL3 GPU usage patterns:

- **Static**: uploaded once, never changes. Covers QR codes, decoded PNGs, icons — the 90% case.
- **Dynamic**: updatable via `update_texture_region`. Covers font atlas growth, procedural updates.
- **Stream**: frequent full re-uploads. Covers video playback, per-frame procedural generation.

This costs one byte in the descriptor and lets the backend pick optimal memory placement without a
future API change.

#### Why samplers are per-draw, not per-texture

A sampler describes how to filter and address a texture during sampling — nearest vs bilinear, clamp
vs repeat. This is a property of the _draw_, not the texture. The same QR code texture should be
sampled with `Nearest_Clamp` when displayed at native resolution but could reasonably be sampled
with `Linear_Clamp` in a zoomed-out thumbnail. The same icon atlas might be sampled with
`Nearest_Clamp` for pixel art or `Linear_Clamp` for smooth scaling.

The RAD Debugger follows this pattern: `R_BatchGroup2DParams` carries `tex_sample_kind` alongside
the texture handle, chosen per batch group at draw time. We do the same — `Sampler_Preset` is a
parameter on the draw procs, not a field on `Texture_Desc`.

Internally, draw keeps a small pool of pre-created `^sdl.GPUSampler` objects (one per preset,
lazily initialized). Sub-batch coalescing keys on `(kind, texture_id, sampler_preset)` — draws
with the same texture but different samplers produce separate draw calls, which is correct.

#### Textured draw procs

Textured rectangles route through the existing SDF path via `rectangle_texture`, which mirrors
`rectangle` exactly — same parameters for radii, origin, rotation, feather — with the `color`
parameter replaced by a `Texture_Id`, an optional `tint`, a `uv_rect`, and a `Sampler_Preset`.

An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
quads, on the theory that the tessellated path's lower register count would improve occupancy at
large fragment counts. Both paths are well within the ≤24-register main pipeline budget — both run at
full occupancy on every target architecture (Valhall and above). The remaining ALU difference (~15
extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise. Meanwhile,
splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU side per scissor,
matching or exceeding the GPU-side savings. Within the main pipeline, unified remains strictly better.

SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `rectangle_texture`,
`circle`, `ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients and outlines are optional
parameters on each proc rather than separate overloads. Future per-shape texture variants
(`circle_texture`, `ellipse_texture`) are additive.

#### What SDF anti-aliasing does and does not do for textured draws

The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
depends on how closely the display size matches the SDL_ttf atlas's rasterized size.

#### Fit modes are a computation layer, not a renderer concept

Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV
sub-region computations on top of the `uv_rect` parameter that both textured-draw procs accept. The
renderer has no knowledge of fit modes — it samples whatever UV region it is given.

A `fit_params` helper computes the appropriate `uv_rect`, sampler preset, and (for letterbox/fit
mode) shrunken inner rect from a `Fit_Mode` enum, the target rect, and the texture's pixel size.
Users who need custom UV control (sprite atlas sub-regions, UV animation, nine-patch slicing) skip
the helper and compute `uv_rect` directly. This keeps the renderer primitive minimal while making
the common cases convenient.

#### Deferred release

`unregister_texture` does not immediately release the GPU texture. It queues the slot for release at
the end of the current frame, after `SubmitGPUCommandBuffer` has handed work to the GPU. This
prevents a race condition where a texture is freed while the GPU is still sampling from it in an
already-submitted command buffer. The same deferred-release pattern is applied to `clear_text_cache`
and `clear_text_cache_entry`, fixing a pre-existing latent bug where destroying a cached
`^sdl_ttf.Text` mid-frame could free an atlas texture still referenced by in-flight draw batches.

This pattern is standard in production renderers — the RAD Debugger's `r_tex2d_release` queues
textures onto a free list that is processed in `r_end_frame`, not at the call site.

#### Clay integration

Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
`Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
existing rectangle handling: `fit_params` computes UVs from the fit mode, then
`rectangle_texture` is called with the appropriate radii (zero for sharp corners, per-corner values
from Clay's `cornerRadius` otherwise).

#### Deferred features

The following are plumbed in the descriptor but not implemented in phase 1:

- **Mipmaps**: `Texture_Desc.mip_levels` field exists; generation via SDL3 deferred.
- **Compressed formats**: `Texture_Desc.format` accepts BC/ASTC; upload path deferred.
- **Render-to-texture**: `Texture_Desc.usage` accepts `.COLOR_TARGET`; render-pass refactor deferred.
- **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
- **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
- **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `circle_texture`, `ellipse_texture`, `polygon_texture` — potential future additions, following the existing naming convention.

**References:**

- RAD Debugger render layer (ownership model, deferred release, sampler-at-draw-time):
  https://github.com/EpicGamesExt/raddebugger — `src/render/render_core.h`, `src/render/d3d11/render_d3d11.c`
- Casey Muratori, Handmade Hero day 472 — texture handling as a renderer-owned resource concern,
  atlases as a separate layer above the renderer.

## 3D rendering

3D pipeline architecture is under consideration and will be documented separately. The current
expectation is that 3D rendering will use dedicated pipelines (separate from the 2D pipelines)
sharing GPU resources (textures, samplers, command buffer lifecycle) with the 2D renderer.

## Multi-window support

The renderer currently assumes a single window via the global `GLOB` state. Multi-window support is
deferred but anticipated. When revisited, the RAD Debugger's bucket + pass-list model
(`src/draw/draw.h`, `src/draw/draw.c` in EpicGamesExt/raddebugger) is worth studying as a reference.

RAD separates draw submission from rendering via **buckets**. A `DR_Bucket` is an explicit handle
that accumulates an ordered list of render passes (`R_PassList`). The user creates a bucket, pushes
it onto a thread-local stack, issues draw calls (which target the top-of-stack bucket), then submits
the bucket to a specific window. Multiple buckets can exist simultaneously — one per window, or one
per UI panel that gets composited into a parent bucket via `dr_sub_bucket`. Implicit draw parameters
(clip rect, 2D transform, sampler mode, transparency) are managed via push/pop stacks scoped to each
bucket, so different windows can have independent clip and transform state without interference.

The key properties this gives RAD:

- **Per-window isolation.** Each window builds its own bucket with its own pass list and state stacks.
  No global contention.
- **Thread-parallel building.** Each thread has its own draw context and arena. Multiple threads can
  build buckets concurrently, then submit them to the render backend sequentially.
- **Compositing.** A pre-built bucket (e.g., a tooltip or overlay) can be injected into another
  bucket with a transform applied, without rebuilding its draw calls.

For our library, the likely adaptation would be replacing the single `GLOB` with a per-window draw
context that users create and pass to `begin`/`end`, while keeping the explicit-parameter draw call
style rather than adopting RAD's implicit state stacks. Texture and sampler resources would remain
global (shared across windows), with only the per-frame staging buffers and layer/scissor state
becoming per-context.

## Building shaders

GLSL shader sources live in `shaders/source/`. Compiled outputs (SPIR-V and Metal Shading Language)
are generated into `shaders/generated/` via the meta tool:

```
odin run meta -- gen-shaders
```

Requires `glslangValidator` and `spirv-cross` on PATH.

### Shader format selection

The library embeds shader bytecode per compile target — MSL + `main0` entry point on Darwin (via
`spirv-cross --msl`, which renames `main` because it is reserved in Metal), SPIR-V + `main` entry
point elsewhere. Three compile-time constants in `draw.odin` expose the build's shader configuration:

| Constant                      | Type                      | Darwin    | Other      |
| ----------------------------- | ------------------------- | --------- | ---------- |
| `PLATFORM_SHADER_FORMAT_FLAG` | `sdl.GPUShaderFormatFlag` | `.MSL`    | `.SPIRV`   |
| `PLATFORM_SHADER_FORMAT`      | `sdl.GPUShaderFormat`     | `{.MSL}`  | `{.SPIRV}` |
| `SHADER_ENTRY`                | `cstring`                 | `"main0"` | `"main"`   |

Pass `PLATFORM_SHADER_FORMAT` to `sdl.CreateGPUDevice` so SDL selects a backend compatible with the
embedded bytecode:

```
gpu := sdl.CreateGPUDevice(draw.PLATFORM_SHADER_FORMAT, true, nil)
```

At init time the library calls `sdl.GetGPUShaderFormats(device)` to verify the active backend
accepts `PLATFORM_SHADER_FORMAT_FLAG`. If it does not, `draw.init` returns `false` with a
descriptive log message showing both the embedded and active format sets.