Backdrop Path + Cybersteel (#23)

Co-authored-by: Zachary Levy <zachary@sunforge.is> Reviewed-on: #23
2026-05-01 05:43:10 +00:00
parent e36229a3ef
commit 5317b8f142
66 changed files with 5806 additions and 2427 deletions
@@ -5,54 +5,60 @@ Clay UI integration.

 ## Current state

-The renderer uses a single unified `Pipeline_2D_Base` (`TRIANGLELIST` pipeline) with two submission
+The renderer uses a single unified `Core_2D` (`TRIANGLELIST` pipeline) with two submission
 modes dispatched by a push constant:

 - **Mode 0 (Tessellated):** Vertex buffer contains real geometry. Used for text (indexed draws into
-  SDL_ttf atlas textures), single-pixel points (`tes_pixel`), arbitrary user geometry (`tes_triangle`,
-  `tes_triangle_fan`, `tes_triangle_strip`), and shapes without a closed-form rounded-rectangle
-  reduction: ellipses (`tes_ellipse`), regular polygons (`tes_polygon`), and circle sectors
-  (`tes_sector`). The fragment shader computes `out = color * texture(tex, uv)`.
+  SDL_ttf atlas textures), single-pixel points (`tess.pixel`), arbitrary user geometry
+  (`tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines`, `tess.triangle_fan`,
+  `tess.triangle_strip`), and any raw vertex geometry submitted via `prepare_shape`. The fragment
+  shader premultiplies the texture sample (`t.rgb *= t.a`) and computes `out = color * t`.

 - **Mode 1 (SDF):** A static 6-vertex unit-quad buffer is drawn instanced, with per-primitive
-  `Primitive` structs (80 bytes each) uploaded each frame to a GPU storage buffer. The vertex shader
-  reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
-  primitive bounds. The fragment shader always evaluates `sdRoundedBox` — there is no per-primitive
-  kind dispatch.
+  `Core_2D_Primitive` structs (96 bytes each) uploaded each frame to a GPU storage buffer. The vertex
+  shader reads `primitives[gl_InstanceIndex]`, computes world-space position from unit quad corners +
+  primitive bounds. The fragment shader dispatches on `Shape_Kind` (encoded in the low byte of
+  `Core_2D_Primitive.flags`) to evaluate one of four signed distance functions:
+  - **RRect** (kind 1) — `sdRoundedBox` with per-corner radii. Covers rectangles (sharp or rounded),
+    circles (uniform radii = half-size), and line segments / capsules (rotated RRect with uniform
+    radii = half-thickness). Covers filled, outlined, textured, and gradient-filled variants.
+  - **NGon** (kind 2) — `sdRegularPolygon` for regular N-sided polygons.
+  - **Ellipse** (kind 3) — `sdEllipseApprox`, an approximate ellipse SDF suitable for UI rendering.
+  - **Ring_Arc** (kind 4) — annular ring with optional angular clipping via pre-computed edge
+    normals. Covers full rings, partial arcs, and pie slices (`inner_radius = 0`).

-The SDF path handles all shapes that are algebraically reducible to a rounded rectangle:
-
- **Rounded rectangles** — per-corner radii via `sdRoundedBox` (iq). Covers filled, stroked,
-  textured, and gradient-filled rectangles.
- **Circles** — uniform radii equal to half-size. Covers filled, stroked, and radial-gradient circles.
- **Line segments / capsules** — rotated RRect with uniform radii equal to half-thickness (stadium shape).
- **Full rings / annuli** — stroked circle (mid-radius with stroke thickness = outer - inner).
-
-All SDF shapes support fill, stroke, solid color, bilinear 4-corner gradients, radial 2-color
-gradients, and texture fills via `Shape_Flags`. Gradient colors are packed into the same 16 bytes as
-the texture UV rect via a `Uv_Or_Gradient` raw union — zero size increase to the 80-byte `Primitive`
-struct. Gradient and texture are mutually exclusive.
+All SDF shapes support fill, outline, solid color, 2-color linear gradients, 2-color radial
+gradients, and texture fills via `Shape_Flags` (see `core_2d.odin`). The texture UV rect
+(`uv_rect: [4]f32`) and the gradient/outline parameters (`effects: Gradient_Outline`) live in their
+own 16-byte slots in `Core_2D_Primitive`, so a primitive can carry texture and outline simultaneously.
+Gradient and texture remain mutually exclusive at the fill-source level (a Brush variant chooses one
+or the other) since they share the worst-case fragment-shader register path.

 All SDF shapes produce mathematically exact curves with analytical anti-aliasing via `smoothstep` —
-no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (80 bytes)
+no tessellation, no piecewise-linear approximation. A rounded rectangle is 1 primitive (96 bytes)
 instead of ~250 vertices (~5000 bytes).

-The fragment shader's estimated register footprint is ~20–23 VGPRs via static live-range analysis.
-RRect and Ring_Arc are roughly tied at peak pressure — RRect carries `corner_radii` (4 regs) plus
-`sdRoundedBox` temporaries, Ring_Arc carries wedge normals plus dot-product temporaries. Both land
-comfortably under Mali Valhall's 32-register occupancy cliff (G57/G77/G78 and later) and well under
-desktop limits. On older Bifrost Mali (G71/G72/G76, 16-register cliff) either shape kind may incur
-partial occupancy reduction. These estimates are hand-counted; exact numbers require `malioc` or
-Radeon GPU Analyzer against the compiled SPIR-V.
+The main pipeline's register budget is **≤24 registers** (see "Main/effects split: register pressure"
+in the pipeline plan below for the full cliff/margin analysis and SBC architecture context).
+The fragment shader's estimated peak footprint is ~22–26 fp32 VGPRs (~16–22 fp16 VGPRs on architectures
+with native mediump) via manual live-range analysis. The dominant peak is the Ring_Arc kind path
+(wedge normals + inner/outer radii + dot-product temporaries live simultaneously with carried state
+like `f_color`, `f_uv_rect`/`f_effects`, and `half_size`). RRect is 1–2 regs lower (`corner_radii` vec4
+replaces the separate inner/outer + normal pairs). NGon and Ellipse are lighter still. Real compilers
+apply live-range coalescing, mediump-to-fp16 promotion, and rematerialization that typically shave
+2–4 regs from hand-counted estimates — the conservative 26-reg upper bound is expected to compile
+down to within the 24-register budget, but this must be verified with `malioc` (see "Verifying
+register counts" below). On V3D and Bifrost architectures (16-register cliff), the compiler
+statically allocates registers for the worst-case path (Ring_Arc) regardless of which kind any given
+fragment actually evaluates, so all fragments pay the occupancy cost of the heaviest branch. This is
+a documented limitation, not a design constraint (see "Known limitations: V3D and Bifrost" below).

-MSAA is opt-in (default `._1`, no MSAA) via `Init_Options.msaa_samples`. SDF rendering does not
-benefit from MSAA because fragment coverage is computed analytically. MSAA remains useful for text
-glyph edges and tessellated user geometry if desired.
-
-All public drawing procs use prefixed names for clarity: `sdf_*` for SDF-path shapes, `tes_*` for
-tessellated-path shapes. Proc groups provide a single entry point per shape concept (e.g.,
-`sdf_rectangle` dispatches to `sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument
-count).
+MSAA is intentionally not supported. SDF text and shapes compute fragment coverage analytically
+via `smoothstep`, so they don't benefit from multisampling. Tessellated user geometry submitted via
+`prepare_shape` is rendered without anti-aliasing — if AA is required for tessellated content, the
+caller must render it to their own offscreen target and submit the result as a texture. This
+decision matches RAD Debugger's architecture and aligns with the SBC target (Mali Valhall, where
+MSAA's per-tile bandwidth multiplier is expensive).

 ## 2D rendering pipeline plan

@@ -66,22 +72,23 @@ primitives and effects can be added to the library without architectural changes
 The 2D renderer uses three GPU pipelines, split by **register pressure** (main vs effects) and
 **render-pass structure** (everything vs backdrop):

-1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Low register
-   footprint (~18–24 registers per thread). Runs at full GPU occupancy on every architecture.
-   Handles 90%+ of all fragments in a typical frame.
+1. **Main pipeline** — shapes (SDF and tessellated), text, and textured rectangles. Register budget:
+   **≤24 registers** (full occupancy on Valhall and all desktop GPUs). Handles 90%+ of all fragments
+   in a typical frame.

 2. **Effects pipeline** — drop shadows, inner shadows, outer glow, and similar ALU-bound blur
-   effects. Medium register footprint (~48–60 registers). Each effects primitive includes the base
+   effects. Register budget: **≤56 registers** (targets Valhall's second cliff at 64; reduced
+   occupancy at the first cliff is accepted by design). Each effects primitive includes the base
   shape's SDF so that it can draw both the effect and the shape in a single fragment pass, avoiding
   redundant overdraw. Separated from the main pipeline to protect main-pipeline occupancy on
   low-end hardware (see register analysis below).

 3. **Backdrop pipeline** — frosted glass, refraction, and any effect that samples the current render
   target as input. Implemented as a multi-pass sequence (downsample, separable blur, composite),
-   where each individual pass has a low-to-medium register footprint (~15–40 registers). Separated
-   from the other pipelines because it structurally requires ending the current render pass and
-   copying the render target before any backdrop-sampling fragment can execute — a command-buffer-
-   level boundary that cannot be avoided regardless of shader complexity.
+   where each individual sub-pass has a register budget of **≤24 registers** (full occupancy on
+   Valhall). Separated from the other pipelines because it structurally requires ending the current
+   render pass and copying the render target before any backdrop-sampling fragment can execute — a
+   command-buffer-level boundary that cannot be avoided regardless of shader complexity.

 A typical UI frame with no effects uses 1 pipeline bind and 0 switches. A frame with drop shadows
 uses 2 pipelines and 1 switch. A frame with shadows and frosted glass uses all 3 pipelines and 2
@@ -97,56 +104,113 @@ code) or many per-primitive-type pipelines (no branching overhead, lean per-shad

 A GPU shader core has a fixed register pool shared among all concurrent threads. The compiler
 allocates registers pessimistically based on the worst-case path through the shader. If the shader
-contains both a 20-register RRect SDF and a 48-register drop-shadow blur, _every_ fragment — even
-trivial RRects — is allocated 48 registers. This directly reduces **occupancy** (the number of
+contains both a 24-register RRect SDF and a 56-register drop-shadow blur, _every_ fragment — even
+trivial RRects — is allocated 56 registers. This directly reduces **occupancy** (the number of
 warps/wavefronts that can run simultaneously), which reduces the GPU's ability to hide memory
 latency.

-Each GPU architecture has a **register cliff** — a threshold above which occupancy starts dropping.
-Below the cliff, adding registers has zero occupancy cost.
+Each GPU architecture has discrete **occupancy cliffs** — register counts above which the number of
+concurrent threads drops in a step. Below the cliff, adding registers has zero occupancy cost. One
+register over, throughput drops sharply.

-On consumer Ampere/Ada GPUs (RTX 30xx/40xx, 65,536 regs/SM, max 1,536 threads/SM, cliff at ~43 regs):
+**Target architecture: ARM Mali Valhall (32-register first cliff).** The binding constraint for our
+register budgets comes from the SBC (single-board computer) market, where Mali Valhall is the
+dominant current GPU architecture:

-| Register allocation      | Reg-limited threads | Actual (hw-capped) | Occupancy |
-| ------------------------ | ------------------- | ------------------ | --------- |
-| ~16 regs (main pipeline) | 4,096               | 1,536              | 100%      |
-| 32 regs                  | 2,048               | 1,536              | 100%      |
-| 48 regs (effects)        | 1,365               | 1,365              | ~89%      |
+- **RK3588-class boards** (Orange Pi 5, Radxa Rock 5, Khadas Edge 2, NanoPi R6, Banana Pi M7) ship
+  **Mali-G610** (Valhall). This is the dominant non-Pi SBC platform. First occupancy cliff at **32
+  registers**, second cliff at **64 registers**.
+- **ARM Mali Valhall** (G57, G77, G78, G610, G710, G715; 2019+) and **5th-gen / Mali-G1** (2024+):
+  same cliff structure — first at 32, second at 64.
+- **ARM Mali Bifrost** (G31, G51, G52, G71, G72, G76; ~2016–2018): first cliff at **16 registers**.
+  Legacy; found on older budget boards (Allwinner H6/H618, Amlogic S922X). See Known limitations
+  below.
+- **Broadcom V3D 4.x / 7.x** (Raspberry Pi 4 / Pi 5): first cliff at **16 registers**. Outlier in
+  the current SBC market. See Known limitations below.
+- **Apple M3+**: Dynamic Caching (register file virtualization) eliminates the static cliff entirely.
+  Register allocation happens at runtime based on actual usage.
+- **Qualcomm Adreno**: dynamic register allocation with soft thresholds; no hard cliff.
+- **NVIDIA desktop** (Ampere/Ada): cliff at ~43 registers. Not a constraint for any of our pipelines.

-On Volta/A100 GPUs (65,536 regs/SM, max 2,048 threads/SM, cliff at ~32 regs):
+**Register budgets and margin.** We target Valhall's 32-register first cliff for the main and
+backdrop pipelines, and Valhall's 64-register second cliff for the effects pipeline, each with **8
+registers of margin**:

-| Register allocation      | Reg-limited threads | Actual (hw-capped) | Occupancy |
-| ------------------------ | ------------------- | ------------------ | --------- |
-| ~16 regs (main pipeline) | 4,096               | 2,048              | 100%      |
-| 32 regs                  | 2,048               | 2,048              | 100%      |
-| 48 regs (effects)        | 1,365               | 1,365              | ~67%      |
+| Pipeline            | Cliff targeted         | Margin | Register budget   | Rationale                                                                                     |
+| ------------------- | ---------------------- | ------ | ----------------- | --------------------------------------------------------------------------------------------- |
+| Main pipeline       | 32 (Valhall 1st cliff) | 8      | **≤24 regs**      | Handles 90%+ of frame fragments; must run at full occupancy                                   |
+| Backdrop sub-passes | 32 (Valhall 1st cliff) | 8      | **≤24 regs** each | Multi-pass structure keeps each pass small; no reason to give up occupancy                    |
+| Effects pipeline    | 64 (Valhall 2nd cliff) | 8      | **≤56 regs**      | Reduced occupancy at 1st cliff accepted by design — the entire point of splitting effects out |

-On low-end mobile (ARM Mali Bifrost/Valhall, 64 regs/thread, cliff fixed at 32 regs):
+**Why 8 registers of margin.** Targeting the cliff exactly is fragile. Three forces push register
+counts upward over a shader's lifetime:

-| Register allocation  | Occupancy                  |
-| -------------------- | -------------------------- |
-| 0–32 regs (main)     | 100% (full thread count)   |
-| 33–64 regs (effects) | ~50% (thread count halves) |
+1. **Compiler version changes.** Mali driver releases (r35p0 → r55p0 etc.) ship new register
+   allocators. Shaders typically drift ±2–3 registers between versions on unchanged source.
+2. **Feature additions.** Each new effect, flag, or uniform adds 1–4 live registers. A new gradient
+   mode or outline option lands in this range.
+3. **Precision regressions.** A `mediump` demoted to `highp` (by bug fix, compiler heuristic change,
+   or a contributor not knowing) costs 2 registers per affected `vec4`.

-Mali's cliff at 32 registers is the binding constraint. On desktop the occupancy difference between
-20 and 48 registers is modest (89–100%); on Mali it is a hard 2× throughput reduction. The
-main/effects split protects 90%+ of a frame's fragments (shapes, text, textures) from the effects
-pipeline's register cost.
+Realistic creep over a couple of years is 4–8 registers. The cost of conservatism is zero — a shader
+at 24 regs runs identically to one at 32 on every Valhall device. The cost of crossing the cliff is
+a 2× throughput drop with no warning. Asymmetric costs justify a generous margin.

-For the effects pipeline's drop-shadow shader — erf-approximation blur math with several texture
-fetches — 50% occupancy on Mali roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
+**Why the main/effects split exists.** If the main pipeline shader contained both the 24-register
+SDF path and the ~50-register drop-shadow blur, every fragment — even trivial RRects — would be
+allocated ~50 registers. On Valhall this crosses the 32-register first cliff, halving occupancy for
+90%+ of the frame's fragments. Separating effects into their own pipeline means the main pipeline
+stays at ≤24 registers (full Valhall occupancy), and only the small fraction of fragments that
+actually render effects (~5–10% in a typical UI) run at reduced occupancy.
+
+For the effects pipeline's drop-shadow shader — analytical erf-approximation blur (~80 FLOPs, no
+texture samples) — 50% occupancy on Valhall roughly halves throughput. At 4K with 1.5× overdraw (~12.4M
 fragments), a single unified shader containing the shadow branch would cost ~4ms instead of ~2ms on
-low-end mobile. This is a per-frame multiplier even when the heavy branch is never taken, because the
+Valhall. This is a per-frame multiplier even when the heavy branch is never taken, because the
 compiler allocates registers for the worst-case path.

-All main-pipeline members (SDF shapes, tessellated geometry, text, textured rectangles) cluster at
-12–24 registers — below the cliff on every architecture — so unifying them costs nothing in
-occupancy.
+The effects pipeline's ≤56-register budget keeps it under Valhall's second cliff at 64, yielding
+50–67% occupancy on effected shapes. This is acceptable for the small fraction of frame fragments
+that effects cover.

-**Note on Apple M3+ GPUs:** Apple's M3 introduces Dynamic Caching (register file virtualization),
-which allocates registers at runtime based on actual usage rather than worst-case. This weakens the
-static register-pressure argument on M3 and later, but the split remains useful for isolating blur
-ALU complexity and keeping the backdrop texture-copy out of the main render pass.
+**Note on Apple M3+ GPUs:** Apple's M3 Dynamic Caching allocates registers at runtime based on
+actual usage rather than worst-case. This eliminates the static register-pressure argument on M3 and
+later, but the split remains useful for isolating blur ALU complexity and keeping the backdrop
+texture-copy out of the main render pass.
+
+**Note on NVIDIA desktop GPUs:** On consumer Ampere/Ada (cliff at ~43 regs), even the effects
+pipeline's ≤56-register budget only reduces occupancy to ~89% — well within noise. On Volta/A100
+(cliff at ~32 regs), the effects pipeline drops to ~67%. In both cases the main pipeline runs at
+100% occupancy. Desktop GPUs are not the binding constraint; Valhall is.
+
+#### Known limitations: V3D and Bifrost (16-register cliff)
+
+Broadcom V3D 4.x / 7.x (Raspberry Pi 4 / Pi 5) and ARM Mali Bifrost (G31, G51, G52, G71, G72, G76)
+have a first occupancy cliff at **16 registers**. All three of our pipelines exceed this cliff — even
+the main pipeline's ≤24-register budget is above 16. On these architectures, every shader runs at
+reduced occupancy regardless of which shape kind or effect is active.
+
+Restoring full occupancy on V3D / Bifrost would require a fundamentally different shader
+architecture: per-shape-kind pipeline splitting (one pipeline per SDF kind, each with a minimal
+register footprint under 16). This conflicts with the unified-pipeline design that enables single
+draw calls per scissor, submission-order Z preservation, and low PSO compilation cost. It would
+effectively be the GPUI-style approach whose tradeoffs are analyzed in "Why not per-primitive-type
+pipelines" below.
+
+We treat this as a documented limitation, not a design constraint. The 16-register cliff is legacy
+(Bifrost) or a single-vendor outlier (V3D). The dominant current SBC platform (RK3588 / Mali-G610)
+and all mainstream mobile and desktop GPUs have cliffs at 32 or higher. The long-term direction in
+GPU architecture is toward eliminating static cliffs entirely (Apple Dynamic Caching, Adreno dynamic
+allocation).
+
+#### Verifying register counts
+
+The register estimates in this document are hand-counted via manual live-range analysis (see Current
+state). Shader changes that affect the main or effects pipeline should be verified with `malioc`
+(ARM Mali Offline Compiler) against current Valhall driver versions before merging. `malioc` reports
+exact register allocation, spilling, and occupancy for each Mali generation. On desktop, Radeon GPU
+Analyzer (RGA) and NVIDIA Nsight provide equivalent data. Replacing the hand-counted estimates with
+measured `malioc` numbers is a follow-up task.

 #### Backdrop split: render-pass structure

@@ -156,10 +220,11 @@ render target must be copied to a separate texture via `CopyGPUTextureToTexture`
 level operation that requires ending the current render pass. This boundary exists regardless of
 shader complexity and cannot be optimized away.

-The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are
-register-light (~15–40 regs each), so merging them into the effects pipeline would cause no occupancy
-problem. But the render-pass boundary makes merging structurally impossible — effects draws happen
-inside the main render pass, backdrop draws happen inside their own bracketed pass sequence.
+The backdrop pipeline's individual shader passes (downsample, separable blur, composite) are budgeted
+at ≤24 registers each (same as the main pipeline), so merging them into the effects pipeline would
+cause no occupancy problem. But the render-pass boundary makes merging structurally impossible —
+effects draws happen inside the main render pass, backdrop draws happen inside their own bracketed
+pass sequence.

 #### Why not per-primitive-type pipelines (GPUI's approach)

@@ -188,9 +253,9 @@ API where each layer draws shadows before quads before glyphs. Our design avoids
 submission order is draw order, no layer juggling required.

 **PSO compilation costs multiply.** Each pipeline takes 1–50ms to compile on Metal/Vulkan/D3D12 at
-first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (MSAA
-variants, blend modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per
-additional axis with 7 pipelines vs 3.
+first use. 7 pipelines is ~175ms cold startup; 3 pipelines is ~75ms. Adding state axes (blend
+modes, color formats) multiplies combinatorially — a 2.3× larger variant matrix per additional
+axis with 7 pipelines vs 3.

 **Branching cost comparison: unified vs per-kind in the effects pipeline.** The effects pipeline is
 the strongest candidate for per-kind splitting because effect branches are heavier than shape
@@ -271,18 +336,23 @@ There are three categories of branch condition in a fragment shader, ranked by c

 #### Which category our branches fall into

-Our design has two branch points:
+Our design has three branch points:

 1. **`mode` (push constant): tessellated vs. SDF.** This is category 2 — uniform per draw call.
   Every thread in every warp of a draw call sees the same `mode` value. **Zero divergence, zero
   cost.**

-2. **`flags` (flat varying from storage buffer): gradient/texture/stroke mode.** This is category 3.
-   The `flat` interpolation qualifier ensures that all fragments rasterized from one primitive's quad
-   receive the same flag bits. However, since the SDF path now evaluates only `sdRoundedBox` with no
-   kind dispatch, the only flag-dependent branches are gradient vs. texture vs. solid color selection
-   — all lightweight (3–8 instructions per path). Divergence at primitive boundaries between
-   different flag combinations has negligible cost.
+2. **`kind` (flat varying from storage buffer): SDF shape kind dispatch.** This is category 3.
+   The low byte of `Primitive.flags` encodes `Shape_Kind` (RRect, NGon, Ellipse, Ring_Arc), passed
+   to the fragment shader as a `flat` varying. All fragments of one primitive's quad receive the same
+   kind value. The fragment shader's `if/else if` chain selects the appropriate SDF function (~15–30
+   instructions per kind). Divergence occurs only at primitive boundaries where adjacent quads have
+   different kinds.
+
+3. **`flags` (flat varying from storage buffer): gradient/texture/outline mode.** Also category 3.
+   The upper bits of `Primitive.flags` encode `Shape_Flags`, controlling gradient vs. texture vs.
+   solid color selection and outline rendering — all lightweight branches (3–8 instructions per
+   path). Divergence at primitive boundaries between different flag combinations has negligible cost.

 For category 3, the divergence analysis depends on primitive size:

@@ -299,11 +369,12 @@ For category 3, the divergence analysis depends on primitive size:
  frame-level divergence is typically **1–3%** of all warps.

 At 1–3% divergence, the throughput impact is negligible. At 4K with 12.4M total fragments
-(~387,000 warps), divergent boundary warps number in the low thousands. Without kind dispatch, the
-longest untaken branch is the gradient evaluation (~8 instructions), not a different SDF function.
-Each divergent warp pays at most ~8 extra instructions. At ~12G instructions/sec on a mid-range GPU,
-that totals ~1.3μs — under 0.02% of an 8.3ms (120 FPS) frame budget. This is
-confirmed by production renderers that use exactly this pattern:
+(~387,000 warps), divergent boundary warps number in the low thousands. The longest SDF kind branch
+is Ring_Arc (~30 instructions); when a divergent warp straddles two different kinds, it pays the cost
+of both (~45–60 instructions total). Each divergent warp's extra cost is modest — at ~12G
+instructions/sec on a mid-range GPU, even 3,000 divergent warps × 60 extra instructions totals
+~15μs, under 0.2% of an 8.3ms (120 FPS) frame budget. This is confirmed by production renderers
+that use exactly this pattern:

 - **vger / vger-rs** (Audulus): single pipeline, 11 primitive kinds dispatched by a `switch` on a
  flat varying `prim_type`. Ships at 120 FPS on iPads. The author (Taylor Holliday) replaced nanovg
@@ -327,10 +398,10 @@ our design:
   > have no per-fragment data-dependent branches in the main pipeline.

 2. **Branches where both paths are very long.** If both sides of a branch are 500+ instructions,
-   divergent warps pay double a large cost. Without kind dispatch, the SDF path always evaluates
-   `sdRoundedBox`; the only branches are gradient/texture/solid color selection at 3–8 instructions
-   each. Even fully divergent, the penalty is ~8 extra instructions — less than a single texture
-   sample's latency.
+   divergent warps pay double a large cost. Our SDF kind branches are short (~15–30 instructions
+   each), and the gradient/texture/solid color selection branches are shorter still (3–8 instructions
+   each). Even fully divergent, the combined penalty is ~30–60 extra instructions — comparable to a
+   single texture sample's latency.

 3. **Branches that prevent compiler optimizations.** Some compilers cannot schedule instructions
   across branch boundaries, reducing VLIW utilization on older architectures. Modern GPUs (NVIDIA
@@ -338,9 +409,10 @@ our design:
   concern.

 4. **Register pressure from the union of all branches.** This is the real cost, and it is why we
-   split heavy effects (shadows, glass) into separate pipelines. Within the main pipeline, the SDF
-   path has a single evaluation (sdRoundedBox) with flag-based color selection, clustering at ~15–18
-   registers, so there is negligible occupancy loss.
+   split heavy effects into separate pipelines. Within the main pipeline, the four
+   SDF kind branches and flag-based color selection cluster at ~22–26 registers (see register
+   analysis in Current state), within the ≤24-register budget that guarantees full occupancy on
+   Valhall and all desktop architectures. See Known limitations for V3D / Bifrost.

 **References:**

@@ -361,27 +433,29 @@ our design:
 ### Main pipeline: SDF + tessellated (unified)

 The main pipeline serves two submission modes through a single `TRIANGLELIST` pipeline and a single
-vertex input layout, distinguished by a mode marker in the `Primitive.flags` field (low byte:
-0 = tessellated, 1 = SDF). The tessellated path sets this to 0 via zero-initialization in the vertex
-shader; the SDF path sets it to 1 via `pack_flags`.
+vertex input layout, distinguished by a `mode` field in the `Vertex_Uniforms_2D` push constant
+(`Core_2D_Mode.Tessellated = 0`, `Core_2D_Mode.SDF = 1`), pushed per draw call via `push_globals`. The
+vertex shader branches on this uniform to select the tessellated or SDF code path.

 - **Tessellated mode** (`mode = 0`): direct vertex buffer with explicit geometry. Used for text
-  (SDL_ttf atlas sampling), triangle fans/strips, ellipses, regular polygons, circle sectors, and
-  any user-provided raw vertex geometry.
- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of `Primitive`
-  structs, drawn instanced. Used for all shapes with closed-form signed distance functions.
+  (SDL_ttf atlas sampling), triangles, triangle fans/strips, single-pixel points, and any
+  user-provided raw vertex geometry.
+- **SDF mode** (`mode = 1`): shared unit-quad vertex buffer + GPU storage buffer of
+  `Core_2D_Primitive` structs, drawn instanced. Used for all shapes with closed-form signed distance
+  functions.

-Both modes use the same fragment shader. The fragment shader checks the mode marker: mode 0 computes
-`out = color * texture(tex, uv)`; mode 1 always evaluates `sdRoundedBox` and applies
-gradient/texture/solid color based on flag bits.
+Both modes use the same fragment shader. The fragment shader checks `Shape_Kind` (low byte of
+`Core_2D_Primitive.flags`): kind 0 (`Solid`) is the tessellated path, which premultiplies the texture
+sample and computes `out = color * t`; kinds 1–4 dispatch to one of four SDF functions (RRect, NGon,
+Ellipse, Ring_Arc) and apply gradient/texture/outline/solid color based on `Shape_Flags` bits.

 #### Why SDF for shapes

 CPU-side adaptive tessellation for curved shapes (the current approach) has three problems:

 1. **Vertex bandwidth.** A rounded rectangle with four corner arcs produces ~250 vertices × 20 bytes
-   = 5 KB. An SDF rounded rectangle is one `Primitive` struct (~56 bytes) plus 4 shared unit-quad
-   vertices. That is roughly a 90× reduction per shape.
+   = 5 KB. An SDF rounded rectangle is one `Core_2D_Primitive` struct (96 bytes) plus 4 shared
+   unit-quad vertices. That is roughly a 50× reduction per shape.

 2. **Quality.** Tessellated curves are piecewise-linear approximations. At high DPI or under
   animation/zoom, faceting is visible at any practical segment count. SDF evaluation produces
@@ -412,60 +486,55 @@ SDF primitives are submitted via a GPU storage buffer indexed by `gl_InstanceInd
 shader, rather than encoding per-primitive data redundantly in vertex attributes. This follows the
 pattern used by both Zed GPUI and vger-rs.

-Each SDF shape is described by a single `Primitive` struct (80 bytes) in the storage buffer. The
-vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position from the unit
-vertex and the primitive's bounds, and passes shape parameters to the fragment shader via `flat`
-interpolated varyings.
+Each SDF shape is described by a single `Core_2D_Primitive` struct (96 bytes) in the storage
+buffer. The vertex shader reads `primitives[gl_InstanceIndex]`, computes the quad corner position
+from the unit vertex and the primitive's bounds, and passes shape parameters to the fragment shader
+via `flat` interpolated varyings.

 Compared to encoding per-primitive data in vertex attributes (the "fat vertex" approach), storage-
 buffer instancing eliminates the 4–6× data duplication across quad corners. A rounded rectangle costs
-80 bytes instead of 4 vertices × 40+ bytes = 160+ bytes.
+96 bytes instead of 4 vertices × 60+ bytes = 240+ bytes.

 The tessellated path retains the existing direct vertex buffer layout (20 bytes/vertex, no storage
 buffer access). The vertex shader branch on `mode` (push constant) is warp-uniform — every invocation
 in a draw call has the same mode — so it is effectively free on all modern GPUs.

-#### Shape folding
+#### Shape kinds and SDF dispatch

-The SDF path evaluates a single function — `sdRoundedBox` — for all primitives. There is no
-`Shape_Kind` enum or per-primitive kind dispatch in the fragment shader. Shapes that are algebraically
-special cases of a rounded rectangle are emitted as RRect primitives by the CPU-side drawing procs:
+The fragment shader dispatches on `Shape_Kind` (low byte of `Core_2D_Primitive.flags`) to evaluate
+one of four signed distance functions. The `Shape_Kind` enum, per-kind `*_Params` structs, and
+CPU-side drawing procs all live in `core_2d.odin`. The drawing procs build the appropriate
+`Core_2D_Primitive` and set the kind automatically:

-| User-facing shape            | RRect mapping                                | Notes                                    |
-| ---------------------------- | -------------------------------------------- | ---------------------------------------- |
-| Rectangle (sharp or rounded) | Direct                                       | Per-corner radii from `radii` param      |
-| Circle                       | `half_size = (r, r)`, `radii = (r, r, r, r)` | Uniform radii = half-size                |
-| Line segment / capsule       | Rotated RRect, `radii = half_thickness`      | Stadium shape (fully-rounded minor axis) |
-| Full ring / annulus          | Stroked circle at mid-radius                 | `stroke_px = outer - inner`              |
+Each user-facing shape proc accepts a `Brush` union (color, linear gradient, radial gradient,
+or textured fill) as its fill source, plus optional outline parameters. The procs map to SDF
+kinds as follows:

-Shapes without a closed-form RRect reduction are drawn via the tessellated path:
+| User-facing proc     | Shape_Kind | SDF function       | Notes                                                      |
+| -------------------- | ---------- | ------------------ | ---------------------------------------------------------- |
+| `rectangle`          | `RRect`    | `sdRoundedBox`     | Per-corner radii from `radii` param                        |
+| `circle`             | `RRect`    | `sdRoundedBox`     | Uniform radii = half-size (circle is a degenerate RRect)   |
+| `line`, `line_strip` | `RRect`    | `sdRoundedBox`     | Rotated capsule — stadium shape (radii = half-thickness)   |
+| `ellipse`            | `Ellipse`  | `sdEllipseApprox`  | Approximate ellipse SDF (fast, suitable for UI)            |
+| `polygon`            | `NGon`     | `sdRegularPolygon` | Regular N-sided polygon inscribed in a circle              |
+| `ring` (full)        | `Ring_Arc` | Annular radial SDF | `max(inner - r, r - outer)` with no angular clipping       |
+| `ring` (partial arc) | `Ring_Arc` | Annular radial SDF | Pre-computed edge normals for angular wedge mask           |
+| `ring` (pie slice)   | `Ring_Arc` | Annular radial SDF | `inner_radius = 0`, angular clipping via `start/end_angle` |

-| Shape                     | Tessellated proc                   | Method                     |
-| ------------------------- | ---------------------------------- | -------------------------- |
-| Ellipse                   | `tes_ellipse`, `tes_ellipse_lines` | Triangle fan approximation |
-| Regular polygon (N-gon)   | `tes_polygon`, `tes_polygon_lines` | Triangle fan from center   |
-| Circle sector (pie slice) | `tes_sector`                       | Triangle fan arc           |
-
-The `Shape_Flags` bit set controls rendering mode per primitive:
-
-| Flag              | Bit | Effect                                                               |
-| ----------------- | --- | -------------------------------------------------------------------- |
-| `Stroke`          | 0   | Outline instead of fill (`d = abs(d) - stroke_width/2`)              |
-| `Textured`        | 1   | Sample texture using `uv.uv_rect` (mutually exclusive with Gradient) |
-| `Gradient`        | 2   | Bilinear 4-corner interpolation from `uv.corner_colors`              |
-| `Gradient_Radial` | 3   | Radial 2-color falloff (inner/outer) from `uv.corner_colors[0..1]`   |
+The `Shape_Flags` bit set controls per-primitive rendering mode (outline, gradient, texture, rotation,
+arc geometry). See the `Shape_Flag` enum in `core_2d.odin` for the authoritative flag
+definitions and bit assignments.

 **What stays tessellated:**

 - Text (SDL_ttf atlas, pending future MSDF evaluation)
- Ellipses (`tes_ellipse`, `tes_ellipse_lines`)
- Regular polygons (`tes_polygon`, `tes_polygon_lines`)
- Circle sectors / pie slices (`tes_sector`)
- `tes_triangle`, `tes_triangle_fan`, `tes_triangle_strip` (arbitrary user-provided geometry)
+- `tess.pixel` (single-pixel points)
+- `tess.triangle`, `tess.triangle_aa`, `tess.triangle_lines` (single triangles)
+- `tess.triangle_fan`, `tess.triangle_strip` (arbitrary user-provided geometry)
 - Any raw vertex geometry submitted via `prepare_shape`

-The design rule: if the shape reduces to `sdRoundedBox`, it goes SDF. If it requires a different SDF
-function or is described by a vertex list, it stays tessellated.
+The design rule: if the shape has a closed-form SDF, it goes through the SDF path with its own
+`Shape_Kind`. If it is described by a vertex list or has no practical SDF, it stays tessellated.

 ### Effects pipeline

@@ -526,44 +595,121 @@ Wallace's variant) and vger-rs.
 ### Backdrop pipeline

 The backdrop pipeline handles effects that sample the current render target as input: frosted glass,
-refraction, mirror surfaces. It is separated from the effects pipeline for a structural reason, not
-register pressure.
+refraction, mirror surfaces. It is separated from the main and effects pipelines for a structural
+reason, not register pressure.

 **Render-pass boundary.** Before any backdrop-sampling fragment can run, the current render target
-must be copied to a separate texture via `CopyGPUTextureToTexture`. This is a command-buffer-level
-operation that cannot happen mid-render-pass. The copy naturally creates a pipeline boundary that no
-amount of shader optimization can eliminate — it is a fundamental requirement of sampling a surface
-while also writing to it.
+must be in a sampler-readable state. A draw call that samples the render target it is also writing
+to is a hard GPU constraint; the only way to satisfy it is to end the current render pass and start
+a new one. That render-pass boundary is what a “bracket” is.

 **Multi-pass implementation.** Backdrop effects are implemented as separable multi-pass sequences
-(downsample → horizontal blur → vertical blur → composite), following the standard approach used by
-iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
-pass has a low-to-medium register footprint (~15–40 registers), well within the main pipeline's
-occupancy range. The multi-pass approach avoids the monolithic 70+ register shader that a single-pass
-Gaussian blur would require, making backdrop effects viable on low-end mobile GPUs (including
-Mali-G31 and VideoCore VI) where per-thread register limits are tight.
+(downsample → horizontal blur → vertical blur → composite), following the standard approach used
+by iOS `UIVisualEffectView`, Android `RenderEffect`, and Flutter's `BackdropFilter`. Each individual
+sub-pass is budgeted at **≤24 registers** (same as the main pipeline — full Valhall occupancy). The
+multi-pass approach avoids the monolithic 70+ register shader that a single-pass Gaussian blur would
+require, keeping each sub-pass well under the 32-register cliff.

-**Bracketed execution.** All backdrop draws in a frame share a single bracketed region of the command
-buffer: end the current render pass, copy the render target, execute all backdrop sub-passes, then
-resume normal drawing. The entry/exit cost (texture copy + render-pass break) is paid once per frame
-regardless of how many backdrop effects are visible. When no backdrop effects are present, the bracket
-is never entered and the texture copy never happens — zero cost.
+**Render-target choice.** When any layer in the frame contains a backdrop draw, the entire
+frame renders into `source_texture` (a full-resolution single-sample texture owned by the
+backdrop pipeline) instead of directly into the swapchain. At the end of the frame,
+`source_texture` is copied to the swapchain via a single `CopyGPUTextureToTexture` call.
+This means the bracket has no mid-frame texture copy: by the time the bracket runs,
+`source_texture` already contains the pre-bracket frame contents and is the natural sampler
+input. When no layer in the frame has a backdrop draw, the existing fast path runs: the frame
+renders directly to the swapchain and the backdrop pipeline's working textures are never
+touched. Zero cost for backdrop-free frames.

-**Why not split the backdrop sub-passes into separate pipelines?** The individual passes range from
-~15 to ~40 registers, which does cross Mali's 32-register cliff. However, the register-pressure argument
-that justifies the main/effects split does not apply here. The main/effects split protects the
-_common path_ (90%+ of frame fragments) from the uncommon path's register cost. Inside the backdrop
-pipeline there is no common-vs-uncommon distinction — if backdrop effects are active, every sub-pass
-runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at all.
-Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at typical
-UI scales), so the occupancy variation within the bracket has negligible impact on frame time.
+**Why not split the backdrop sub-passes into separate pipelines?** Each sub-pass is budgeted at ≤24
+registers, well under Valhall's 32-register cliff, so there is no occupancy motivation for splitting.
+The sub-passes also have no common-vs-uncommon distinction — if backdrop effects are active, every
+sub-pass runs; if not, none run. The backdrop pipeline either executes as a complete unit or not at
+all. Additionally, backdrop effects cover a small fraction of the frame's total fragments (~5% at
+typical UI scales), so even if a sub-pass did cross a cliff, the occupancy variation within the
+bracket would have negligible impact on frame time.
+
+#### Bracket scheduling model
+
+The bracket is scheduled per layer, anchored at the first backdrop sub-batch in the layer's
+submission order. Concretely, a layer with one or more backdrops splits into three groups:
+
+1. **Pass A (pre-bracket)** — every non-backdrop sub-batch with index `< bracket_start_index`.
+   Renders to `source_texture` in a single render pass.
+2. **The bracket** — every backdrop sub-batch in the layer (regardless of index). Runs one
+   downsample pass, then one (H-blur + V-composite) pass pair per unique sigma.
+3. **Pass B (post-bracket)** — every non-backdrop sub-batch with index `>= bracket_start_index`.
+   Renders to `source_texture` with `LOAD`, drawing on top of the composited backdrop output.
+
+`bracket_start_index` is the absolute index of the first `.Backdrop` kind in the layer's sub-batch
+range. If the layer has no backdrops, none of this kicks in and the layer renders in a single render
+pass via the existing fast path.
+
+Per-sigma-group execution. The bracket walks each layer's sub-batches and groups contiguous
+`.Backdrop` sub-batches that share a sigma; each group picks its own downsample factor (1, 2, or 4)
+based on `compute_backdrop_downsample_factor`. For each group it runs four sub-passes: a downsample
+from `source_texture` to `downsample_texture`; an H-blur from `downsample_texture` to
+`h_blur_texture`; a V-blur from `h_blur_texture` back into `downsample_texture` (ping-pong reuse);
+and finally a composite that reads the fully-blurred `downsample_texture`, applies the SDF mask
+and tint, and writes the result to `source_texture`. Sub-batch coalescing in
+`append_or_extend_sub_batch` merges contiguous same-sigma backdrops into a single instanced
+composite draw; non-contiguous same-sigma backdrops still share the blur output but issue separate
+composite draws.
+
+The working textures are sized at the full swapchain resolution; larger downsample factors only
+fill a sub-rect via viewport-limited rendering (see the comment block at the top of `backdrop.odin`
+for the factor-selection table and rationale).
+
+#### Submission-order trade-off
+
+Within Pass A and Pass B, sub-batches render in the user's submission order. What the bracket model
+sacrifices is _interleaved_ ordering between backdrop and non-backdrop content within a single
+layer. A non-backdrop sub-batch submitted between two backdrops still renders in Pass B (after the
+bracket), not at its submission position. Worked example:
+
+```
+draw.rectangle(layer, bg, GRAY)                 // 0  Tessellated  → Pass A
+draw.rectangle(layer, card_blue, BLUE)          // 1  SDF          → Pass A
+draw.gaussian_blur(layer, panelA, sigma=12)     // 2  Backdrop     → Bracket  (sees: bg + blue card)
+draw.rectangle(layer, card_red, RED)            // 3  SDF          → Pass B   (drawn ON TOP of panelA)
+draw.gaussian_blur(layer, panelB, sigma=12)     // 4  Backdrop     → Bracket  (sees: bg + blue card; same as panelA)
+draw.text(layer, "label", ...)                  // 5  Text         → Pass B   (drawn ON TOP of both panels)
+```
+
+In this layer, panelB does _not_ see card_red — even though card_red was submitted before panelB —
+because both backdrops sample `source_texture` as it stood at the bracket entry, which is after
+Pass A and before card_red has rendered. card_red ends up on top of panelA, not underneath it.
+
+The user controls the alternative outcome by splitting layers. Putting card_red and panelB into a
+new layer (via `draw.new_layer`) gives panelB a fresh source snapshot that includes panelA and
+card_red:
+
+```
+base := draw.begin(...)
+draw.rectangle(base, bg, GRAY)
+draw.rectangle(base, card_blue, BLUE)
+draw.gaussian_blur(base, panelA, sigma=12)   // panelA in base layer's bracket
+
+top := draw.new_layer(base, ...)
+draw.rectangle(top, card_red, RED)
+draw.gaussian_blur(top, panelB, sigma=12)    // top layer's bracket; sees base + card_red
+draw.text(top, "label", ...)
+```
+
+Why one bracket per layer and not one per backdrop? Each bracket adds three render passes
+(downsample + H-blur + V-composite) and at least three tile-cache flushes on tilers like Mali
+Valhall. Strict submission-order semantics would require one bracket per cluster of contiguous
+backdrops, which scales the GPU cost linearly with how interleaved the user's submission happens
+to be — a footgun. The current design caps the bracket cost per layer regardless of submission
+interleave, and gives the user explicit control over ordering through the existing layer
+abstraction. This matches the cost/complexity envelope of iOS `UIVisualEffectView` and CSS
+`backdrop-filter` (both of which constrain backdrop ordering implicitly).

 ### Vertex layout

 The vertex struct is unchanged from the current 20-byte layout:

 ```
-Vertex :: struct {
+Vertex_2D :: struct {
    position: [2]f32,  //  0: screen-space position
    uv:       [2]f32,  //  8: atlas UV (text) or unused (shapes)
    color:    Color,   // 16: u8x4, GPU-normalized to float
@@ -575,25 +721,30 @@ draws, `position` carries actual world-space geometry. For SDF draws, `position`
 corners (0,0 to 1,1) and the vertex shader computes world-space position from the storage-buffer
 primitive's bounds.

-The `Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:
+The `Core_2D_Primitive` struct for SDF shapes lives in the storage buffer, not in vertex attributes:

 ```
-Primitive :: struct {
-    bounds:      [4]f32,          //  0: min_x, min_y, max_x, max_y
-    color:       Color,           // 16: u8x4, unpacked in shader via unpackUnorm4x8
-    flags:       u32,             // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
-    rotation_sc: u32,             // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
-    _pad:        f32,             // 28: reserved for future use
-    params:      Shape_Params,    // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
-    uv:          Uv_Or_Effects,   // 64: texture UV rect or gradient/outline parameters (16 bytes)
+Core_2D_Primitive :: struct {
+    bounds:      [4]f32,           //  0: min_x, min_y, max_x, max_y
+    color:       Color,            // 16: u8x4, unpacked in shader via unpackUnorm4x8
+    flags:       u32,              // 20: low byte = Shape_Kind, bits 8+ = Shape_Flags
+    rotation_sc: u32,              // 24: packed f16 pair (sin, cos). Requires .Rotated flag.
+    _pad:        f32,              // 28: reserved for future use
+    params:      Shape_Params,     // 32: per-kind params union (half_feather, radii, etc.) (32 bytes)
+    uv_rect:     [4]f32,           // 64: texture UV coordinates. Read when .Textured.
+    effects:     Gradient_Outline, // 80: gradient and/or outline parameters (16 bytes).
 }
-// Total: 80 bytes (std430 aligned)
+// Total: 96 bytes (std430 aligned)
 ```

-`RRect_Params` holds the rounded-rectangle parameters directly — there is no `Shape_Params` union.
-`Uv_Or_Gradient` is a `#raw_union` that aliases `[4]f32` (texture UV rect) with `[4]Color` (gradient
-corner colors, clockwise from top-left: TL, TR, BR, BL). The `flags` field encodes both the
-tessellated/SDF mode marker (low byte) and shape flags (bits 8+) via `pack_flags`.
+`Shape_Params` is a `#raw_union` over `RRect_Params`, `NGon_Params`, `Ellipse_Params`, and
+`Ring_Arc_Params` (plus a `raw: [8]f32` view), defined in `core_2d.odin`. Each SDF kind
+writes its own params variant; the fragment shader reads the appropriate fields based on `Shape_Kind`.
+`Gradient_Outline` is a 16-byte struct containing `gradient_color: Color`, `outline_color: Color`,
+`gradient_dir_sc: u32` (packed f16 cos/sin pair), and `outline_packed: u32` (packed f16 outline
+width). It is independent of `uv_rect`, so a primitive can carry texture and outline parameters at
+the same time. The `flags` field encodes the `Shape_Kind` in the low byte and `Shape_Flags` in bits
+8+ via `pack_kind_flags`.

 ### Draw submission order

@@ -617,17 +768,16 @@ pair into bitmap atlases and emits indexed triangle data via `GetGPUTextDrawData
 **unchanged** by the SDF migration — text continues to flow through the main pipeline's tessellated
 mode with `mode = 0`, sampling the SDL_ttf atlas texture.

-A future phase may evaluate MSDF (multi-channel signed distance field) text rendering, which would
+MSDF (multi-channel signed distance field) text rendering may be evaluated later, which would
 allow resolution-independent glyph rendering from a single small atlas per font. This would involve:

 - Offline atlas generation via Chlumský's msdf-atlas-gen tool.
 - Runtime glyph metrics via `vendor:stb/truetype` (already in the Odin distribution).
- A new MSDF glyph mode in the fragment shader, which would require reintroducing a mode/kind
-  distinction (the current shader evaluates only `sdRoundedBox` with no kind dispatch).
+- A new MSDF glyph `Shape_Kind` in the fragment shader (additive — the kind dispatch infrastructure
+  already exists for the four current SDF kinds).
 - Potential removal of the SDL_ttf dependency.

-This is explicitly deferred. The SDF shape migration is independent of and does not block text
-changes.
+This is explicitly deferred.

 **References:**

@@ -641,8 +791,8 @@ changes.
 ### Textures

 Textures plug into the existing main pipeline — no additional GPU pipeline, no shader rewrite. The
-work is a resource layer (registration, upload, sampling, lifecycle) plus two textured-draw procs
-that route into the existing tessellated and SDF paths respectively.
+work is a resource layer (registration, upload, sampling, lifecycle) plus a `Texture_Fill` Brush
+variant that routes the existing shape procs through the SDF path with the `.Textured` flag set.

 #### Why draw owns registered textures

@@ -692,31 +842,30 @@ with the same texture but different samplers produce separate draw calls, which

 #### Textured draw procs

-Textured rectangles route through the existing SDF path via `sdf_rectangle_texture` and
-`sdf_rectangle_texture_corners`, mirroring `sdf_rectangle` and `sdf_rectangle_corners` exactly —
-same parameters, same naming — with the color parameter replaced by a texture ID plus an optional
-tint.
+Textures share the same shape procs as colors and gradients. Each shape proc takes a `Brush`
+union as its fill source; passing a `Texture_Fill` value (carrying `Texture_Id`, `tint`,
+`uv_rect`, and `Sampler_Preset`) routes the draw through the SDF path with the `.Textured`
+flag set. There is no dedicated `rectangle_texture` / `circle_texture` proc — the same
+`rectangle`, `circle`, `ellipse`, `polygon`, `ring`, `line`, and `line_strip` procs handle
+all fill sources.

-An earlier iteration of this design considered a separate tessellated proc for "simple" fullscreen
-quads, on the theory that the tessellated path's lower register count (~16 regs vs ~18 for the SDF
-textured branch) would improve occupancy at large fragment counts. Applying the register-pressure
-analysis from the pipeline-strategy section above shows this is wrong: both 16 and 18 registers are
-well below the register cliff (~43 regs on consumer Ampere/Ada, ~32 on Volta/A100), so both run at
-100% occupancy. The remaining ALU difference (~15 extra instructions for the SDF evaluation) amounts
-to ~20μs at 4K — below noise. Meanwhile, splitting into a separate pipeline would add ~1–5μs per
-pipeline bind on the CPU side per scissor, matching or exceeding the GPU-side savings. Within the
-main pipeline, unified remains strictly better.
+A separate tessellated proc for "simple" fullscreen quads was considered on the theory that
+the tessellated path's lower register count would improve occupancy at large fragment counts.
+Both paths are well within the ≤24-register main pipeline budget — both run at full
+occupancy on every target architecture (Valhall and above). The remaining ALU difference
+(~15 extra instructions for the SDF evaluation) amounts to ~20μs at 4K — below noise.
+Meanwhile, splitting into a separate pipeline would add ~1–5μs per pipeline bind on the CPU
+side per scissor, matching or exceeding the GPU-side savings. Within the main pipeline,
+unified remains strictly better.

-The naming convention uses `sdf_` and `tes_` prefixes to indicate the rendering path, with suffixes
-for modifiers: `sdf_rectangle_texture` and `sdf_rectangle_texture_corners` sit alongside
-`sdf_rectangle` (solid or gradient overload). Proc groups like `sdf_rectangle` dispatch to
-`sdf_rectangle_solid` or `sdf_rectangle_gradient` based on argument count. Future per-shape texture
-variants (`sdf_circle_texture`) are additive.
+SDF drawing procs live in the `draw` package with unprefixed names (`rectangle`, `circle`,
+`ellipse`, `polygon`, `ring`, `line`, `line_strip`). Gradients, textures, and outlines are
+selected via the `Brush` union and optional outline parameters rather than separate overloads.

 #### What SDF anti-aliasing does and does not do for textured draws

 The SDF path anti-aliases the **shape's outer silhouette** — rounded-corner edges, rotated edges,
-stroke outlines. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
+outline edges. It does not anti-alias or sharpen the texture content. Inside the shape, fragments
 sample through the chosen `Sampler_Preset`, and image quality is whatever the sampler produces from
 the source texels. A low-resolution texture displayed at a large size shows bilinear blur regardless
 of which draw proc is used. This matches the current text-rendering model, where glyph sharpness
@@ -725,8 +874,8 @@ depends on how closely the display size matches the SDL_ttf atlas's rasterized s
 #### Fit modes are a computation layer, not a renderer concept

 Standard image-fit behaviors (stretch, fill/cover, fit/contain, tile, center) are expressed as UV
-sub-region computations on top of the `uv_rect` parameter that both textured-draw procs accept. The
-renderer has no knowledge of fit modes — it samples whatever UV region it is given.
+sub-region computations on top of the `uv_rect` field of `Texture_Fill`. The renderer has no
+knowledge of fit modes — it samples whatever UV region it is given.

 A `fit_params` helper computes the appropriate `uv_rect`, sampler preset, and (for letterbox/fit
 mode) shrunken inner rect from a `Fit_Mode` enum, the target rect, and the texture's pixel size.
@@ -750,13 +899,13 @@ textures onto a free list that is processed in `r_end_frame`, not at the call si

 Clay's `RenderCommandType.Image` is handled by dereferencing `imageData: rawptr` as a pointer to a
 `Clay_Image_Data` struct containing a `Texture_Id`, `Fit_Mode`, and tint color. Routing mirrors the
-existing rectangle handling: zero `cornerRadius` dispatches to `sdf_rectangle_texture` (SDF, sharp
-corners), nonzero dispatches to `sdf_rectangle_texture_corners` (SDF, per-corner radii). A
-`fit_params` call computes UVs from the fit mode before dispatch.
+existing rectangle handling: `fit_params` computes UVs from the fit mode, then `rectangle` is
+called with a `Texture_Fill` brush and the appropriate radii (zero for sharp corners, per-corner
+values from Clay's `cornerRadius` otherwise).

 #### Deferred features

-The following are plumbed in the descriptor but not implemented in phase 1:
+The following are plumbed in `Texture_Desc` but not yet implemented:

 - **Mipmaps**: `Texture_Desc.mip_levels` field exists; generation via SDL3 deferred.
 - **Compressed formats**: `Texture_Desc.format` accepts BC/ASTC; upload path deferred.
@@ -764,7 +913,6 @@ The following are plumbed in the descriptor but not implemented in phase 1:
 - **3D textures, arrays, cube maps**: `Texture_Desc.type` and `depth_or_layers` fields exist.
 - **Additional samplers**: anisotropic, trilinear, clamp-to-border — additive enum values.
 - **Atlas packing**: internal optimization for sub-batch coalescing; invisible to callers.
- **Per-shape texture variants**: `sdf_circle_texture`, `tes_ellipse_texture`, `tes_polygon_texture` — potential future additions, reserved by naming convention.

 **References:**